Does anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index. It uses xpdf to extract strings from pdf's, antiword for .doc, and since it is perl, the Spreadsheet::ParseExcel module for .xls. Some documents parse/index quickly, some extremely slowly, and in the .xls case some seem to hang forever. I think the real issue is when the parsers (correctly or incorrectly) detect a wide character set and the indexer is confused when trying to re-encode it. What is the best approach to debug something that might be in the perl character set handlers?
On Fri, Aug 28, 2009 at 7:20 AM, Les Mikeselllesmikesell@gmail.com wrote:
Does anyone have experience with linux tools to parse the text from common non-text file formats for searching?
http://www.google.com/url?q=http://en.wikipedia.org/wiki/Pdftotext&ei=qs...
?
Greetings,
On Fri, Aug 28, 2009 at 10:50 PM, Les Mikeselllesmikesell@gmail.com wrote:
Does anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index.
I am not sure this answers your query to the point.
But I have seen Lucene .net SDK (With extensions to scour .doc, .odt, .pdf etc.) to very good effect and pretty decent performance.
HTH
Thanks and Regards
Rajagopal
Rajagopal Swaminathan wrote:
Greetings,
On Fri, Aug 28, 2009 at 10:50 PM, Les Mikeselllesmikesell@gmail.com wrote:
Does anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index.
I am not sure this answers your query to the point.
But I have seen Lucene .net SDK (With extensions to scour .doc, .odt, .pdf etc.) to very good effect and pretty decent performance.
Wouldn't that have to be run under windows? I think the 'catdoc' package from the epel repo with catdoc for word, 'xls2csv' for excel may be usable. Apache POI might work too, but it would probably be slow to launch a jvm for every file. I'm not sure anything does visio, though.
Greetings.
On Mon, Aug 31, 2009 at 10:38 PM, Les Mikeselllesmikesell@gmail.com wrote:
Wouldn't that have to be run under windows?
Indeed. That was where that particular requirement was. One app wanted fulltext search on a bunch of .doc,,,,, etc. files
But I demonstrated the POC using Centos with Sun Java Stack and the other dependencies for making Apache Solr (wrapper around Lucene API) work.
I know I am not precise enough here.. But you get the drift...
I'm not sure anything does visio, though.
I have not tried that
Thanks and Regards
Rajagopal