Hi all,
I have 2 linux distro —— ubuntu and centos.
My problem is that the sort command has different behavior when sorting Chinese string encoded in utf8 file.
On Ubuntu, it is OK. But on CentOS, it WRONG.
I google this problem and it seems that's because of LC_COLLATE.
So I change "/etc/sysconfig/i18n" on CentOS and now the 2 have the same LC_** like this:
CentOS: ============================= [root@localhost ~]# locale LANG=zh_CN.UTF-8 LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC="zh_CN.UTF-8" LC_TIME="zh_CN.UTF-8" LC_COLLATE=zh_CN.UTF-8 LC_MONETARY="zh_CN.UTF-8" LC_MESSAGES="zh_CN.UTF-8" LC_PAPER="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_ADDRESS="zh_CN.UTF-8" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" LC_IDENTIFICATION="zh_CN.UTF-8" LC_ALL=
Ubuntu ============================= peter@ubuntu:~$ locale LANG=zh_CN.UTF-8 LANGUAGE=zh_CN:zh LC_CTYPE="zh_CN.UTF-8" LC_NUMERIC="zh_CN.UTF-8" LC_TIME="zh_CN.UTF-8" LC_COLLATE="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" LC_MESSAGES="zh_CN.UTF-8" LC_PAPER="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_ADDRESS="zh_CN.UTF-8" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" LC_IDENTIFICATION="zh_CN.UTF-8" LC_ALL=
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
But, the result is still incorrect on CentOS! I almost got crazy!!!
PS: the background of this problem is that Postgresql's "order by" command depends on the sort result of the OS.
localedef -f UTF-8 -i zh_CN zh_CN.UTF-8
Nothing happened.
---------- Forwarded message ---------- From: Peter Cai newptcai@gmail.com Date: Tue, Sep 9, 2008 at 9:42 AM Subject: Problem of "sort" utf8 file. To: centos@centos.org
Hi all,
I have 2 linux distro —— ubuntu and centos.
My problem is that the sort command has different behavior when sorting Chinese string encoded in utf8 file.
On Ubuntu, it is OK. But on CentOS, it WRONG.
I google this problem and it seems that's because of LC_COLLATE.
So I change "/etc/sysconfig/i18n" on CentOS and now the 2 have the same LC_** like this:
CentOS: ============================= [root@localhost ~]# locale LANG=zh_CN.UTF-8 LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC="zh_CN.UTF-8" LC_TIME="zh_CN.UTF-8" LC_COLLATE=zh_CN.UTF-8 LC_MONETARY="zh_CN.UTF-8" LC_MESSAGES="zh_CN.UTF-8" LC_PAPER="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_ADDRESS="zh_CN.UTF-8" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" LC_IDENTIFICATION="zh_CN.UTF-8" LC_ALL=
Ubuntu ============================= peter@ubuntu:~$ locale LANG=zh_CN.UTF-8 LANGUAGE=zh_CN:zh LC_CTYPE="zh_CN.UTF-8" LC_NUMERIC="zh_CN.UTF-8" LC_TIME="zh_CN.UTF-8" LC_COLLATE="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" LC_MESSAGES="zh_CN.UTF-8" LC_PAPER="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_ADDRESS="zh_CN.UTF-8" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" LC_IDENTIFICATION="zh_CN.UTF-8" LC_ALL=
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
But, the result is still incorrect on CentOS! I almost got crazy!!!
PS: the background of this problem is that Postgresql's "order by" command depends on the sort result of the OS.
Peter Cai wrote:
PS: the background of this problem is that Postgresql's "order by" command depends on the sort result of the OS.
AFAIK PostgreSQL will determine its own locale from the system locale when it's initdb'ed for the first time, that locale will then be used for all databases even if you later change the system locale. Perhaps you need to dump your databases and do a new initdb with the proper locale set before this starts working the way you want.
-tgc
I already did that.
The problem is not that the locale is not correct, but even if it's correct, the sort order is wrong. The result provided by pg is the same as the "sort" command. It shows that pg use OS's facility to sort string.
Someone told me it's a bug of glibc. So I checked it and found that ubuntu use glibc 2.7 while centos5 use glibc 2.5. That maybe the problem.
But updating glibc is too hard for me. Maybe I have to give up.
On Tue, Sep 9, 2008 at 3:05 PM, Tom G. Christensen tgc@statsbiblioteket.dk wrote:
Peter Cai wrote:
PS: the background of this problem is that Postgresql's "order by" command depends on the sort result of the OS.
AFAIK PostgreSQL will determine its own locale from the system locale when it's initdb'ed for the first time, that locale will then be used for all databases even if you later change the system locale. Perhaps you need to dump your databases and do a new initdb with the proper locale set before this starts working the way you want.
-tgc
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos