I have a file which contains non-ASCII characters (umlauts, accented characters, etc.) both in its filename as well as its content. The only way I have been able to see these characters is inside vim, where they are displayed correctly no matter what I have LANG set to. My default LANG is en_US.utf8, but I have tried de_DE.utf8, de_DE.iso88591, and various others. In the output of a simple ls, the characters are shown as either a simple "?" or a "?" in a box depending on what I have LANG set to, and the same is true if I use cat to display the contents. I have tried this both in an xterm as well as a gnome-terminal window on a CentOS 5.3 system.
I am taking stabs in the dark here as I've never had to deal with LANG and locales before. Am I missing something obvious or is there more to it than simply setting LANG. Since I can see the characters correctly in vim, I'm sure it's not a font issue. I just want to be able to display these characters correctly with simple Linux commands like ls and cat.
Thanks, Alfred
Alfred von Campe a écrit :
I have a file which contains non-ASCII characters (umlauts, accented characters, etc.) both in its filename as well as its content. The only way I have been able to see these characters is inside vim, where they are displayed correctly no matter what I have LANG set to. My default LANG is en_US.utf8, but I have tried de_DE.utf8, de_DE.iso88591, and various others. In the output of a simple ls, the characters are shown as either a simple "?" or a "?" in a box depending on what I have LANG set to, and the same is true if I use cat to display the contents. I have tried this both in an xterm as well as a gnome-terminal window on a CentOS 5.3 system.
I am taking stabs in the dark here as I've never had to deal with LANG and locales before.
The 'file' command displays encoding information. If you have to change the encoding, use 'recode'. Example :
$ recode latin1..utf8 file
... from latin 1 (iso-8859-1 or the likes) to UTF-8.
$ recode utf8..latin1 file
... the other way around.
Check the syntax with the two dots between encodings.
Cheers,
Niki
On Oct 27, 2009, at 9:45, Niki Kovacs wrote:
The 'file' command displays encoding information. If you have to change the encoding, use 'recode'. Example :
Thanks for the quick response, Niki, but I don't need to change the encoding (at least I don't think I do). I just want ls to show me the non-ASCII characters in its output and cat to display the characters properly in my xterm or gnome-terminal. Currently, both of these commands display them as "?". My current test case has a file which contains a "ç" (0xE7, c with cedilla) in its filename, and an "í" (0xED, i acute) inside the file.
Alfred
Alfred von Campe a écrit :
On Oct 27, 2009, at 9:45, Niki Kovacs wrote:
The 'file' command displays encoding information. If you have to change the encoding, use 'recode'. Example :
Thanks for the quick response, Niki, but I don't need to change the encoding (at least I don't think I do). I just want ls to show me the non-ASCII characters in its output and cat to display the characters properly in my xterm or gnome-terminal. Currently, both of these commands display them as "?". My current test case has a file which contains a "ç" (0xE7, c with cedilla) in its filename, and an "í" (0xED, i acute) inside the file.
[kikinovak@babasse:~] $ touch "Fichier encodé en français" [kikinovak@babasse:~] $ touch "Wie heißt diese Datei denn bloß äh" [kikinovak@babasse:~] $ ls F* W* Fichier encodé en français Wie heißt diese Datei denn bloß äh
What's your current system-wide locale ?
[kikinovak@babasse:~] $ echo $LANG fr_FR.UTF-8
Cheers,
Niki
On Oct 27, 2009, at 10:51, Niki Kovacs wrote:
[kikinovak@babasse:~] $ touch "Fichier encodé en français" [kikinovak@babasse:~] $ touch "Wie heißt diese Datei denn bloß äh" [kikinovak@babasse:~] $ ls F* W* Fichier encodé en français Wie heißt diese Datei denn bloß äh
To be honest, I don't even know how to create those characters on the command line on Linux (I am writing this on a Mac where I know how to generate characters using the option key). However, I have an existing file on Linux that has the problem I described. If you must know, I wrote a small shell script that creates this file by cutting and pasting non-ASCII characters from the iso_8859-1 man page.
What's your current system-wide locale ?
en_US.UTF-8
In case you are wondering why I am asking about this when I don't even know how to type these characters, is that I have a user who wants to be able to use non-ASCII characters in file names.
Thanks, Alfred
Alfred von Campe a écrit :
To be honest, I don't even know how to create those characters on the command line on Linux (I am writing this on a Mac where I know how to generate characters using the option key).
I vaguely remember Mac uses UTF-16 as default encoding. This could be the source of your problem.
On Oct 27, 2009, at 13:40, Niki Kovacs wrote:
I vaguely remember Mac uses UTF-16 as default encoding. This could be the source of your problem.
Forget I said anything about the Mac; I'm only using it to write these emails. The file in question was completely created on Linux. The filename contains the character 0xE7 (c with cedilla) and the file itself contains the character 0xED (i acute). Neither character is displayed correctly using ls (filename) or cat (content), but I can look at the file with vim. Here is some output cut&pasted from my xterm window to illustrate the issue:
bash-3.2$ ls -l XXX* -rw-r--r-- 1 av16209 GRP-HEPDSW 22 oct 27 14:11 XXX? bash-3.2$ cat test.sh #!/bin/sh echo "This is an i acute: � > XXX� bash-3.2$
I have also attached a gzip'ed test.sh.
Alfred
On 10/27/2009 02:16 PM Alfred von Campe wrote:
On Oct 27, 2009, at 13:40, Niki Kovacs wrote:
I vaguely remember Mac uses UTF-16 as default encoding. This could be the source of your problem.
Forget I said anything about the Mac; I'm only using it to write these emails. The file in question was completely created on Linux. The filename contains the character 0xE7 (c with cedilla) and the file itself contains the character 0xED (i acute). Neither character is displayed correctly using ls (filename) or cat (content), but I can look at the file with vim. Here is some output cut&pasted from my xterm window to illustrate the issue:
bash-3.2$ ls -l XXX* -rw-r--r-- 1 av16209 GRP-HEPDSW 22 oct 27 14:11 XXX? bash-3.2$ cat test.sh #!/bin/sh echo "This is an i acute: � > XXX� bash-3.2$
I have also attached a gzip'ed test.sh.
Alfred
Alfred,
Seems to me that the value of the (badly displaying) characters are being preserved, but they're not being displayed properly. That is, the shell isn't accessing the correct font set(s). Your report also tells us that cim is somehow capable of grabbing the correct fonts. To verify this, make vim create the file names (in the languages in question). E.g., create a file with vi with just one German/Greek/French word, say, Έντελέχεια (Entylecheia, an ancient Greek word). If the name of the file is "nonenglish", then, after you do your save in vim, run the shell commands
touch temp; mv temp $(cat nonenglish)
This should rename the temp file to the Greek word inside of the nonenglish file. If I'm understanding you correctly, the Greek word should appear fine inside of vim, but will be butchered by the shell. Then do
ls > dirfile; vi dirfile
and see if the Greek word appears correct again in vim.
Tell us what you find.
On Oct 27, 2009, at 19:28, ken wrote:
E.g., create a file with vi with just one German/Greek/French word, say, Έντελέχεια (Entylecheia, an ancient Greek word). If the name of the file is "nonenglish", then, after you do your save in vim, run the shell commands
touch temp; mv temp $(cat nonenglish)
I guess my issue is how these characters get generated in the first place. By cutting and pasting the word "Έντελέχεια" from your email into a file on Linux (via the Synergy mouse & keyboard sharing utility no less), I was able to create a file containing that word and also named that word and display it correctly with cat and ls. So UTF-8 encoding appears to work just fine. It's 8-byte characters in ISO 8859-1 encoding that are causing my problem. Fortunately, I think I don't have to deal with ISO 8859-1 encodings, and my problem was self-created by cutting and pasting characters from the iso_8859-1 man page.
Now I have a follow up question: so far I've only been able to enter non-ASCII characters on my Linux system by cutting & pasting; how do I actually generate any of these characters on a system with a US keyboard?
Thanks for all that have helped me solve this problem.
Alfred
On 10/28/2009 09:10 AM Alfred von Campe wrote:
On Oct 27, 2009, at 19:28, ken wrote:
E.g., create a file with vi with just one German/Greek/French word, say, Έντελέχεια (Entylecheia, an ancient Greek word). If the name of the file is "nonenglish", then, after you do your save in vim, run the shell commands
touch temp; mv temp $(cat nonenglish)
I guess my issue is how these characters get generated in the first place. By cutting and pasting the word "Έντελέχεια" from your email into a file on Linux (via the Synergy mouse & keyboard sharing utility no less), I was able to create a file containing that word and also named that word and display it correctly with cat and ls. So UTF-8 encoding appears to work just fine. It's 8-byte characters in ISO 8859-1 encoding that are causing my problem. Fortunately, I think I don't have to deal with ISO 8859-1 encodings, and my problem was self-created by cutting and pasting characters from the iso_8859-1 man page.
Now I have a follow up question: so far I've only been able to enter non-ASCII characters on my Linux system by cutting & pasting; how do I actually generate any of these characters on a system with a US keyboard?
Thanks for all that have helped me solve this problem.
Alfred
There are a lot of keyboard configuration files under /lib/kbd/keymaps/. One of these is loaded at boot-time, probably the one you configured in when you first set up the system. I don't know all the steps you'll need to do-- I've never tried to do what you're doing-- but read the xmodmap manpage and then examine the keycodes in the keymap files mentioned above. For example, mk-utf.map.gz under /lib/kbd/keymaps/i386/qwerty has coding to toggle one keymap to another. IOW, you'd type in one language, hit a couple keys to toggle the keyboard into another language, and then hit another couple/three hotkeys to get back to English... or whichever your home language is.
Unless there's some app I don't know about, this is going to be a lot of work, especially if you have to figure out how keymaps work. But work it out and you'll be linux-famous.
Document everything.
On 10/27/2009 07:16 PM, Alfred von Campe wrote:
The filename contains the character 0xE7 (c with cedilla) and the file itself contains the character 0xED (i acute). Neither character is displayed correctly using ls (filename) or cat (content), but I can look at the file with vim. Here is some output cut&pasted from my xterm window to illustrate the issue:
...
If your locale is UTF8, íéèæøå would be multibyte characters.
If your characters are one byte only, they are not UTF-8.
vim knows how to handle this correctly:
If you open the file with vi (you would see the text [converted] on the bottom line), and do:
:set fileencoding=utf-8
and write out the file again it should be converted so that cat displays it correctly.
You can use the convmv script to convert filenames into utf-8 (yum install convmv).
Mogens
On Oct 28, 2009, at 2:59, Mogens Kjaer wrote:
If your locale is UTF8, íéèæøå would be multibyte characters.
If your characters are one byte only, they are not UTF-8.
That was the key: the file was not UTF-8.
vim knows how to handle this correctly:
Yes, it apparently does. It almost appears to be magic how it figures this out...
If you open the file with vi (you would see the text [converted] on the bottom line), and do:
:set fileencoding=utf-8
and write out the file again it should be converted so that cat displays it correctly.
Yes, it certainly does. Thanks for the tip. I guess I have a lot to learn about encodings...
You can use the convmv script to convert filenames into utf-8 (yum install convmv).
I will check this out!
Thanks, Alfred