On Wed, 3 Aug 2011, Always Learning wrote: > On Wed, 2011-08-03 at 11:03 -0700, Todd wrote: > >> indeed no, but I want to work on some pattern matching, analysis for a >> piece of software I have wanted to write for years.. > > Lots of success and good luck. Do let us know how it goes. umm -- high speed, automated harvesting of email and running regex against the corpus to yield say, a list of currently live addresses seems to fit the problem description. Why would you wish the creation of a yet another such spammer tool, good luck? ;) That said, procmail can do such trivially, and single pass filtering a million pieces a day is trivial, but the bandwidth to get it to a single machine is rather high for a residential link ... trivial in a colo let's do some science: >From my mailspool, I have 6124 pieces taking up 139,083,522 bytes just now [herrold at centos-5 ~]$ echo "( 139083522 / 6124 ) " | bc 22711 so 22k bytes per piece x 1 million ~= 22 G per day 86400 seconds in a day, on the simplifying assumption that one has a level steady state load (which could be done by setting a peripheral MX unit to handle the inload). I was handling 750k / day with a central unit and two MX satelites on RHL 7 with 200 MHz Pentiums and perhaps 64M or ram in them [herrold at centos-5 ~]$ echo "22000000000 / 86400" | bc 254629 bytes per second so roughly a T-1 A single Linux box on a 386 with 16M ram running RHL 4.0 a decade ago had no problem with such loads. Getting an efficient regex algorithm would be the choke point -- Russ herrold