[CentOS] Robust Search Solution (with CentOS 4.3)

Thu Apr 13 01:53:36 UTC 2006
Paul <subsolar at subsolar.com>

On Wed, 2006-04-12 at 05:47 -0700, Mike Stankovic wrote:
> --- Paul <subsolar at subsolar.com> wrote:
> 
> > On Tue, 2006-04-11 at 06:55 -0700, Mike Stankovic
> > wrote:
> > > I've got about 10,000 docs I'd like to devise a
> > > search/index for. I found a perl script called
> > > Perlfect that can do that on an old P3 but at the
> > > astronomical time of 7 hours. Another
> > script(cgi/perl)
> > > at hotscripts  can do the same but allows the "rm
> > -rf
> > > /" exploit. DoH!?
> > > 
> > > Is there anything perl/flatfile that can
> > search/index
> > > faster? This is a  nice  job for an aging P3 in
> > the
> > > corner so php/MySQL is not an option. Don't
> > suggest
> > > beagle/windows solutions as this is a CentOS 4.3
> > system.
> > 
> > Well at work we have an archive of ~ 12K PDFs that
> > engineering uses for
> > process documentations and I use Swish-e
> > (http://swish-e.org/) to index
> > it so that they can search it.  The server it sits
> > on is a PIII 733 with
> > 512MB RAM and it takes about 90 minutes to re-index
> > them every night.
> > 
> > It works well for us as it allows AND & OR
> > operators, searches for
> > phrases and other fairly advanced features.
> > 
> > The main limitation is that you need a filter to
> > convert whatever the
> > document is to one of the following: text, html or
> > xml so it can be
> > indexed.
> > 
> > Regards,
> > Paul Berger
> > 
> > > __________________________________________________
> > > Improve the mailing list by performing a simple
> > search 
> > > before posting and reading the faq/etiquette. 
> > > Thank you!!
> > > 
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around 
> > > http://mail.yahoo.com 
> > > _______________________________________________
> > > CentOS mailing list
> > > CentOS at centos.org
> > > http://lists.centos.org/mailman/listinfo/centos
> > > 
> > 
> > _______________________________________________
> > CentOS mailing list
> > CentOS at centos.org
> > http://lists.centos.org/mailman/listinfo/centos
> > 
> 
> Yes Swish-e is in dag's repo and appears to be
> supported upstream very well. I was right about
> htsearch it is one of the components of htdig (also
> available in rpm format).
> 
> Does it have issues with charsets that are not Latin-1
> (ISO-8859-1) or plain 7bit ASCII ?

I don't know off hand ... I found the following in the Swish-e FAQ...
http://swish-e.org/devel/devel_docs/swish-faq.html

How do I index non-English words?
Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1
character set, and includes many non-English letters (and symbols). As
long as they are listed in WordCharacters they will be indexed.

Actually, you probably can index any 8-bit character set, as long as you
don't mix character sets in the same index and don't use libxml2 for
parsing (see below).

The TranslateCharacters directive (SWISH-CONFIG) can translate
characters while indexing and searching. You may specify the mapping of
one character to another character with the TranslateCharacters
directive.

TranslateCharacters :ascii7: is a predefined set of characters that will
translate eight-bit characters to ascii7 characters. Using the :ascii7:
rule will, for example, translate "Ääç" to "aac". This means: searching
"Çelik", "çelik" or "celik" will all match the same word.

Note: When using libxml2 for parsing, parsed documents are converted
internally (within libxml2) to UTF-8. This is converted to ISO 8859-1
Latin-1 when indexing. In cases where a string can not be converted from
UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the
string will be sent to Swish-e in UTF-8 encoding. This will results in
some words indexed incorrectly. Setting ParserWarningLevel to 1 or more
will display warnings when UTF-8 to 8859-1 conversion fails.


> 
> __________________________________________________
> Improve the mailing list by performing a simple search 
> before posting and reading the faq/etiquette. 
> Thank you!!
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>