I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another script(cgi/perl) at hotscripts can do the same but allows the "rm -rf /" exploit. DoH!?
Is there anything perl/flatfile that can search/index faster? This is a nice job for an aging P3 in the corner so php/MySQL is not an option. Don't suggest beagle/windows solutions as this is a CentOS 4.3 system.
__________________________________________________ Improve the mailing list by performing a simple search before posting and reading the faq/etiquette. Thank you!!
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Am Di, den 11.04.2006 schrieb Mike Stankovic um 15:55:
I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another script(cgi/perl) at hotscripts can do the same but allows the "rm -rf /" exploit. DoH!?
Is there anything perl/flatfile that can search/index faster? This is a nice job for an aging P3 in the corner so php/MySQL is not an option. Don't suggest beagle/windows solutions as this is a CentOS 4.3 system.
Is htdig an option?
Alexander
--- Alexander Dalloz ad+lists@uni-x.org wrote:
Am Di, den 11.04.2006 schrieb Mike Stankovic um 15:55:
I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another
script(cgi/perl)
at hotscripts can do the same but allows the "rm
-rf
/" exploit. DoH!?
Is there anything perl/flatfile that can
search/index
faster? This is a nice job for an aging P3 in
the
corner so php/MySQL is not an option. Don't
suggest
beagle/windows solutions as this is a CentOS 4.3
system.
Is htdig an option?
Alexander
I'll give htdig a try. I was confusing it with htsearch which i inherited from the CentOS 3 days.
Swish is another one I'm also looking at.
__________________________________________________ Improve the mailing list by performing a simple search before posting and reading the faq/etiquette. Thank you!!
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Tue, 2006-04-11 at 06:55 -0700, Mike Stankovic wrote:
I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another script(cgi/perl) at hotscripts can do the same but allows the "rm -rf /" exploit. DoH!?
Is there anything perl/flatfile that can search/index faster? This is a nice job for an aging P3 in the corner so php/MySQL is not an option. Don't suggest beagle/windows solutions as this is a CentOS 4.3 system.
Well at work we have an archive of ~ 12K PDFs that engineering uses for process documentations and I use Swish-e (http://swish-e.org/) to index it so that they can search it. The server it sits on is a PIII 733 with 512MB RAM and it takes about 90 minutes to re-index them every night.
It works well for us as it allows AND & OR operators, searches for phrases and other fairly advanced features.
The main limitation is that you need a filter to convert whatever the document is to one of the following: text, html or xml so it can be indexed.
Regards, Paul Berger
Improve the mailing list by performing a simple search before posting and reading the faq/etiquette. Thank you!!
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
--- Paul subsolar@subsolar.com wrote:
On Tue, 2006-04-11 at 06:55 -0700, Mike Stankovic wrote:
I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another
script(cgi/perl)
at hotscripts can do the same but allows the "rm
-rf
/" exploit. DoH!?
Is there anything perl/flatfile that can
search/index
faster? This is a nice job for an aging P3 in
the
corner so php/MySQL is not an option. Don't
suggest
beagle/windows solutions as this is a CentOS 4.3
system.
Well at work we have an archive of ~ 12K PDFs that engineering uses for process documentations and I use Swish-e (http://swish-e.org/) to index it so that they can search it. The server it sits on is a PIII 733 with 512MB RAM and it takes about 90 minutes to re-index them every night.
It works well for us as it allows AND & OR operators, searches for phrases and other fairly advanced features.
The main limitation is that you need a filter to convert whatever the document is to one of the following: text, html or xml so it can be indexed.
Regards, Paul Berger
Improve the mailing list by performing a simple
search
before posting and reading the faq/etiquette. Thank you!!
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam
protection around
http://mail.yahoo.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Yes Swish-e is in dag's repo and appears to be supported upstream very well. I was right about htsearch it is one of the components of htdig (also available in rpm format).
Does it have issues with charsets that are not Latin-1 (ISO-8859-1) or plain 7bit ASCII ?
__________________________________________________ Improve the mailing list by performing a simple search before posting and reading the faq/etiquette. Thank you!!
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Wed, 2006-04-12 at 05:47 -0700, Mike Stankovic wrote:
--- Paul subsolar@subsolar.com wrote:
On Tue, 2006-04-11 at 06:55 -0700, Mike Stankovic wrote:
I've got about 10,000 docs I'd like to devise a search/index for. I found a perl script called Perlfect that can do that on an old P3 but at the astronomical time of 7 hours. Another
script(cgi/perl)
at hotscripts can do the same but allows the "rm
-rf
/" exploit. DoH!?
Is there anything perl/flatfile that can
search/index
faster? This is a nice job for an aging P3 in
the
corner so php/MySQL is not an option. Don't
suggest
beagle/windows solutions as this is a CentOS 4.3
system.
Well at work we have an archive of ~ 12K PDFs that engineering uses for process documentations and I use Swish-e (http://swish-e.org/) to index it so that they can search it. The server it sits on is a PIII 733 with 512MB RAM and it takes about 90 minutes to re-index them every night.
It works well for us as it allows AND & OR operators, searches for phrases and other fairly advanced features.
The main limitation is that you need a filter to convert whatever the document is to one of the following: text, html or xml so it can be indexed.
Regards, Paul Berger
Improve the mailing list by performing a simple
search
before posting and reading the faq/etiquette. Thank you!!
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam
protection around
http://mail.yahoo.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Yes Swish-e is in dag's repo and appears to be supported upstream very well. I was right about htsearch it is one of the components of htdig (also available in rpm format).
Does it have issues with charsets that are not Latin-1 (ISO-8859-1) or plain 7bit ASCII ?
I don't know off hand ... I found the following in the Swish-e FAQ... http://swish-e.org/devel/devel_docs/swish-faq.html
How do I index non-English words? Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1 character set, and includes many non-English letters (and symbols). As long as they are listed in WordCharacters they will be indexed.
Actually, you probably can index any 8-bit character set, as long as you don't mix character sets in the same index and don't use libxml2 for parsing (see below).
The TranslateCharacters directive (SWISH-CONFIG) can translate characters while indexing and searching. You may specify the mapping of one character to another character with the TranslateCharacters directive.
TranslateCharacters :ascii7: is a predefined set of characters that will translate eight-bit characters to ascii7 characters. Using the :ascii7: rule will, for example, translate "Ääç" to "aac". This means: searching "Çelik", "çelik" or "celik" will all match the same word.
Note: When using libxml2 for parsing, parsed documents are converted internally (within libxml2) to UTF-8. This is converted to ISO 8859-1 Latin-1 when indexing. In cases where a string can not be converted from UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the string will be sent to Swish-e in UTF-8 encoding. This will results in some words indexed incorrectly. Setting ParserWarningLevel to 1 or more will display warnings when UTF-8 to 8859-1 conversion fails.
Improve the mailing list by performing a simple search before posting and reading the faq/etiquette. Thank you!!
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos