[CentOS] Document Scanning and Storage

Wed Sep 12 23:26:00 UTC 2007
Bill Campbell <centos at celestial.com>

On Wed, Sep 12, 2007, Dennis McLeod wrote:
>I'd like to start scanning our boxed up documents. I'd say about 30,000
>files total.
>Mostly to eliminate the boxes of paper we have. 
>I'd like to scan them, store them, Have some sort of index, and be able to
>retrieve them on multiple machines. I think PDF would be the desired format.
>I'd like be able to set some permissions as well. (not a deal breaker...)
>I've searched Sourceforge, and have seen knowledgetree, myDMS, contineo,
>etc, but really would like to hear from someone that is using something

This is not a trivial operation.

I was a principal in a company that developed a Linux based system to do
this about 8 years ago, with a product good enough that it made national
news when Bill Gates' home town of Medina Washington bought a system from
us, not a Windows based system.

The scanning can be done pretty nicely using a scanner with and ADF
(Automatic Document Feeder), and xsane has the ability to number pages
skipping numbers so one can can both sides of two-sided documents in two
passes.  The biggest issue is probably doing the OCR conversion to get text
for indexing.  We used proprietary software from Vividata for this which
worked pretty well.  I haven't looked seriously at gocr or other open
source OCR software for Linux so don't know how well it would work.

I've been using the ReadIris OCR software on Macs recently, which has some
very nice features such as handling multi-page PDF files well.

If I were to tackle this today, I would probably do it using Plone since it
handles things like indexing and organization well.

INTERNET:   bill at celestial.com  Bill Campbell; Celestial Software LLC
URL: http://www.celestial.com/  PO Box 820; 6641 E. Mercer Way
FAX:            (206) 232-9186  Mercer Island, WA 98040-0820; (206) 236-1676

Breathe fire, slay dragons, and take chances. Failure is temporary, regret
is eternal.