Greetings,
I am able to get a english word list in <file> by using the following command
cat <file> | tr -sc A-Za-z '\012'
My question is how to specify unicode character and ASCII. Specifically text text file containing 3 byte sequence starting with \x0e in the tr command.
I am able to see the character using:
echo -e '\xe0\xa5\xbf'
What regex incantation would make tr give the results I want?
I am new to unicode.
Regards,
Rajagopal
I am able to get a english word list in <file> by using the following command
cat <file> | tr -sc A-Za-z '\012'
My question is how to specify unicode character and ASCII. Specifically text text file containing 3 byte sequence starting with \x0e in the tr command.
I am able to see the character using:
echo -e '\xe0\xa5\xbf'
What regex incantation would make tr give the results I want?
I am new to unicode.
You don't say much as to what bounds the words, spaces? Give more info, but http://www.regular-expressions.info/unicode.html leads to some Perl solutions.
Greetings,
On Wed, Feb 3, 2010 at 11:03 AM, Joseph L. Casale jcasale@activenetwerx.com wrote:
You don't say much as to what bounds the words, spaces? Give more info, but http://www.regular-expressions.info/unicode.html leads to some Perl solutions.
Thanks for the quick reply.
I have started perusing it.
Perl is currently martian to me :) . Hope to gain fluency in that in the very near future.
The said unicode strings (with multi-byte "points") may be bound by comma, single quotes, space etc. I am ready to sacrifice all characters except the [:alpha:] and unicode strings.
Thanks again and Regards,
Rajagopal