On Wed, 3 Aug 2011, Always Learning wrote:
On Wed, 2011-08-03 at 11:03 -0700, Todd wrote:
indeed no, but I want to work on some pattern matching, analysis for a piece of software I have wanted to write for years..
Lots of success and good luck. Do let us know how it goes.
umm -- high speed, automated harvesting of email and running regex against the corpus to yield say, a list of currently live addresses seems to fit the problem description. Why would you wish the creation of a yet another such spammer tool, good luck? ;)
That said, procmail can do such trivially, and single pass filtering a million pieces a day is trivial, but the bandwidth to get it to a single machine is rather high for a residential link ... trivial in a colo
let's do some science:
From my mailspool, I have 6124 pieces taking up 139,083,522
bytes just now
[herrold@centos-5 ~]$ echo "( 139083522 / 6124 ) " | bc 22711
so 22k bytes per piece x 1 million ~= 22 G per day
86400 seconds in a day, on the simplifying assumption that one has a level steady state load (which could be done by setting a peripheral MX unit to handle the inload). I was handling 750k / day with a central unit and two MX satelites on RHL 7 with 200 MHz Pentiums and perhaps 64M or ram in them
[herrold@centos-5 ~]$ echo "22000000000 / 86400" | bc 254629 bytes per second
so roughly a T-1
A single Linux box on a 386 with 16M ram running RHL 4.0 a decade ago had no problem with such loads. Getting an efficient regex algorithm would be the choke point
-- Russ herrold