On Tue, 25 Jul 2006, Durval Menezes wrote:
Here's my submission for a new package: catdoc, a nice Word/Excel/Powerpoint -> plaintext conversion utility.
The URL to the SRPM is: http://www.durval.com.br/RPMS/el4/catdoc/catdoc-0.94.2-2dm.el4.src.rpm
I've attached the .spec file.
When we built catdoc for SL3x I found that there was a problem parsing *some* XLS files -- it incorrectly guessed that some things were using 2-byte lengths then treated the next byte as a charset descriptor and ends up thinking it is using multi-byte chars, and wanders off the end of the data. The code to 'guess' the header data format is really quite ugly.
[ incidentally at least some versions of gnumeric seem to have a very similar issue, maybe that code is from a common source... ]
I applied a small patch which seems to work for us, though I got no reply from the author when I offered it.
We also added a patch make xls2csv quote even *null* values which solved a problem for us (though that is probably not what everyone wants...)
Our rpm also installs wordview by a different name to avoid a clash with another app of the same name (:-), but the srpm might (just) be worth glancing at:
http://www.damtp.cam.ac.uk/user/jp107/sl3x-updates/SRPMS/catdoc-0.94.2-2.JSP...
If you want I can possibly arrange to provide a .XLS file which upsets catdoc (all my existing examples contain personal-data so I'd have to get one made specially -- the ones which cause problems have been e-mailed to us from another site).