tcler's blog --- 其实我是一个程序员
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

Fedora-28 glibc/regex 库更新导致问题 早有预谋?

问题

Zorro 这两天发现一个测试套件在新的系统上很诡异的安装失败了,调查发现是 glibc 更新了 regex 库导致的; 具体表现就是 grep [a-z] 匹配不能得到预期结果,大写字母也会匹配到;但上游开发确说这是预期结果,又因为信息不对称开发给出的基于权重算法的解释 我们并没有看懂~

[yjh@test nfs]$ echo ABCD | grep -o [a-z]
A
B
C
D

调查分析

先通过猜测: 是否跟 locale 设置有关? 然后添加 LANG=C 验证,果然得到预期结果

[yjh@test nfs]$ echo ABCDa | LANG=C grep -o [a-z]
a

然后翻看 man page 才发现不止最新的 Fedora-28, 甚至 RHEL-3 系统的 man grep 都有相同的描述

[root@bkr-host ~]# lsb_release -sir
CentOS 3.9
[root@bkr-host ~]# man grep | col -bx | grep -i 'A bracket expression' -A 16
       A bracket expression is a list of characters enclosed by [ and ].   It
       matches  any  single character in that list; if the first character of
       the list is the caret ^ then it matches any character not in the list.
       For  example,  the  regular expression [0123456789] matches any single
       digit.

       Within a bracket expression, a range expression consists of two  char-
       acters  separated  by  a hyphen.  It matches any single character that
       sorts between the two characters, inclusive, using the  locale’s  col-
       lating  sequence  and  character  set.   For example, in the default C
       locale, [a-d] is equivalent to [abcd].  Many locales  sort  characters
       in  dictionary  order,  and  in  these  locales [a-d] is typically not
       equivalent to [abcd]; it might be equivalent to [aBbCcDd],  for  exam-
       ple.  To obtain the traditional interpretation of bracket expressions,
       you can use the C locale by setting the LC_ALL environment variable to
       the value C.

原来这次的 glibc 修改是 “早有预谋” 的 … 原来我们一直以来对 [a-z] [A-Z] 的用法是 “靠巧合编程”, 能得到预期结果的前提是必须 LC_ALL=C; 其他 locale 设置,[a-d] 的预期结果可能是 [abcd] 也可能 [aBbCcDd] 还可能 [aAbBcCd] …

[yjh@test nfs]$ echo {a..z} {A..Z} | grep [a-d] -o | xargs
a b c d A B C

近一步调查验证

根据 man page 描述 “Many locales sort characters in dictionary order, “, 所以我们可以 猜测关于这次修改: 至少我们用到的 en_US.utf8 zh_CN.utf8 的字符字典顺序被修改了:

原先是 A .. Z -> a .. z , 而现在改成 aA bB .. zZ

再结合 “Fedora 28 Update: glibc-2.27-30.fc28”[1] 中的 bug 1582229[2] 的描述,我们猜测排序里不止包含 aA 而是更多 比如 “aAÀâ” “bBB`” … , 验证如下:

总结

以后些正则最好使用 [:alnum:], [:alpha:],[:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:] 来代替 [start-end] 用法; 或者保证自己的程序在 LC_ALL=C 环境下执行~

*如果上游想向后兼容 [a-z],并希望匹配 “â” 类的字符,字典顺序估计还得改成 AÀ .. ZZ -> aâ .. zz; 目前大家还在讨论中~


*补充:

letter with breve (带短音符号的字符集) 的测试结果对比(LANG=zh_CN.UTF-8):

[yjh@rhel7 ~]$ echo "Ă㥹ȀȁȂȃaA" | grep -o [a-z] | xargs
ă ą ȁ ȃ a
[yjh@rhel7 ~]$ echo "Ă㥹ȀȁȂȃaA" | grep -o [A-Z] | xargs
Ă Ą Ȁ Ȃ A
[yjh@rhel7 ~]$ echo "Ă㥹ȀȁȂȃaA" | grep -o [[=a=]] | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ a A
[yjh@rhel7 ~]$ echo "Ă㥹ȀȁȂȃaA" | grep -o [[=A=]] | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ a A
# RHEL-6 跟 RHEL-7 测试结果相同

[yjh@F-28 ~]$  echo "Ă㥹ȀȁȂȃaA" | grep -o [a-z] | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ a A
[yjh@F-28 ~]$  echo "Ă㥹ȀȁȂȃaA" | grep -o [A-Z] | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ A
[yjh@F-28 ~]$  grep -o [[=a=]] <<< "Ă㥹ȀȁȂȃaA" | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ a A
[yjh@F-28 ~]$  grep -o [[=A=]] <<< "Ă㥹ȀȁȂȃaA" | xargs
Ă ă Ą ą Ȁ ȁ Ȃ ȃ a A

说了半天 你问我 上游好好的为啥要改字典顺序!? 还是不清楚~ 继续等开发具体解释吧; 但是以后写正则匹配大小写还是用 [[:upper:]] [[:lower:]] 吧~


(2018-07-20)最新进展,经过邮件讨论上游意识到影响面太大,开始尝试向后兼容了: https://www.sourceware.org/ml/libc-alpha/2018-07/msg00620.html

BTW: 为了方便测试直接拷贝运行,测试脚本放这里吧

echo 'grep -o [[=a=]] <<< "Ă㥹ȀȁȂȃaA" | xargs
grep -o [[=A=]] <<< "Ă㥹ȀȁȂȃaA" | xargs
grep -o [a-z] <<< "Ă㥹ȀȁȂȃaA" | xargs
grep -o [A-Z] <<< "Ă㥹ȀȁȂȃaA" | xargs' | LD_LIBRARY_PATH=/usr/lib bash -i

*more infomation

在之前的调查中,曾发现 gawk 并没有受这次修改的影响: 不管什么 Locale 设置,都能得到预期的结果。今天(2018-07-24) yoyang 在 gawk 手册里找到了原因,标题很形象的描述出了这个问题导致的困惑和混乱: A Long .. Sad .. Story :)

A.8 Regexp Ranges and Locales: A Long Sad Story: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html