I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
Can anyone shed some light on hard disk space (to retain this e-mail for long periods) and system specs to be able to handle the load?
I am looking to buy a low end box, but that can hold lots of RAM and accomodate a fair number of HD's to store the e-mail while I try my experiments.
Can anyone provide some realistic specs while maintaining a small budget?
-Jason
indeed no, but I want to work on some pattern matching, analysis for a piece of software I have wanted to write for years..
On Wed, Aug 3, 2011 at 10:59 AM, Always Learning centos@u6.u22.net wrote:
On Wed, 2011-08-03 at 10:53 -0700, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
-- With best regards,
Paul. England, EU.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Wed, 3 Aug 2011, Always Learning wrote:
On Wed, 2011-08-03 at 11:03 -0700, Todd wrote:
indeed no, but I want to work on some pattern matching, analysis for a piece of software I have wanted to write for years..
Lots of success and good luck. Do let us know how it goes.
umm -- high speed, automated harvesting of email and running regex against the corpus to yield say, a list of currently live addresses seems to fit the problem description. Why would you wish the creation of a yet another such spammer tool, good luck? ;)
That said, procmail can do such trivially, and single pass filtering a million pieces a day is trivial, but the bandwidth to get it to a single machine is rather high for a residential link ... trivial in a colo
let's do some science:
From my mailspool, I have 6124 pieces taking up 139,083,522
bytes just now
[herrold@centos-5 ~]$ echo "( 139083522 / 6124 ) " | bc 22711
so 22k bytes per piece x 1 million ~= 22 G per day
86400 seconds in a day, on the simplifying assumption that one has a level steady state load (which could be done by setting a peripheral MX unit to handle the inload). I was handling 750k / day with a central unit and two MX satelites on RHL 7 with 200 MHz Pentiums and perhaps 64M or ram in them
[herrold@centos-5 ~]$ echo "22000000000 / 86400" | bc 254629 bytes per second
so roughly a T-1
A single Linux box on a 386 with 16M ram running RHL 4.0 a decade ago had no problem with such loads. Getting an efficient regex algorithm would be the choke point
-- Russ herrold
Always Learning wrote:
On Wed, 2011-08-03 at 10:53 -0700, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
He's got a copy of carnivore to read them?
mark
On Wed, 2011-08-03 at 14:05 -0400, m.roth@5-cent.us wrote:
Always Learning wrote:
On Wed, 2011-08-03 at 10:53 -0700, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
He's got a copy of carnivore to read them?
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29
and http://en.wikipedia.org/wiki/NarusInsight ?
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
He's got a copy of carnivore to read them?
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29
My first thought here is the movie 'SwordFish'....
Always Learning wrote:
On Wed, 2011-08-03 at 14:05 -0400, m.roth@5-cent.us wrote:
Always Learning wrote:
On Wed, 2011-08-03 at 10:53 -0700, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
He's got a copy of carnivore to read them?
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29 and http://en.wikipedia.org/wiki/NarusInsight ?
Yup, that's what I was thinking of. Missed the replacement, and it's been a year or three since news came out on, mmmm, was it slashdot, or usenet, about carnivore....
mark
On Wed, 2011-08-03 at 14:29 -0400, m.roth@5-cent.us wrote:
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29 and http://en.wikipedia.org/wiki/NarusInsight ?
Yup, that's what I was thinking of. Missed the replacement, and it's been a year or three since news came out on, mmmm, was it slashdot, or usenet, about carnivore....
Oh Golly, I am using an unfortunately name of 'Always Learning' which is what the Feds, NSA and all the others are constantly doing with other people's private affairs ..... :-(
I wonder how effective they are per $1 billion spent. 0.0001% ?
On 8/3/2011 1:34 PM, Always Learning wrote:
On Wed, 2011-08-03 at 14:29 -0400, m.roth@5-cent.us wrote:
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29 and http://en.wikipedia.org/wiki/NarusInsight ?
Yup, that's what I was thinking of. Missed the replacement, and it's been a year or three since news came out on, mmmm, was it slashdot, or usenet, about carnivore....
Oh Golly, I am using an unfortunately name of 'Always Learning' which is what the Feds, NSA and all the others are constantly doing with other people's private affairs ..... :-(
I wonder how effective they are per $1 billion spent. 0.0001% ?
Good enough to keep them in power. How much more would you expect?
Les Mikesell wrote:
On 8/3/2011 1:34 PM, Always Learning wrote:
On Wed, 2011-08-03 at 14:29 -0400, m.roth@5-cent.us wrote:
Not this one ? http://en.wikipedia.org/wiki/Carnivore_%28software%29 and http://en.wikipedia.org/wiki/NarusInsight ?
Yup, that's what I was thinking of. Missed the replacement, and it's been a year or three since news came out on, mmmm, was it slashdot, or usenet, about carnivore....
Oh Golly, I am using an unfortunately name of 'Always Learning' which is what the Feds, NSA and all the others are constantly doing with other people's private affairs ..... :-(
I wonder how effective they are per $1 billion spent. 0.0001% ?
Good enough to keep them in power. How much more would you expect?
Power? Nahhh... good enough to be able to write buzzword-filled reports, and statistics, so as to keep their budgets up, and their jobs.
mark
On 8/3/2011 1:59 PM, Always Learning wrote:
On Wed, 2011-08-03 at 10:53 -0700, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists.
You're surely not going to read all of them ;-)
That might even be more difficult than keeping up with the CentOS list!!!! (sorry, and here I am adding to the nonsense)
John Hinton
Google is your friend, but things to think about 1) distribution of receipt. Don't make the mistake of 1,000,000/(24*60) to spec your network and i/o capacity. Depending on your taste, a distributed file system using iscsi or one of the cluster filesystems may be a good idea... 2) size of emails which may affect 3) inode configuration on the disk 4) DNS lookup times 5) SPAM processing load, but maybe you want the spam too
Really need to know what you mean by 'small budget' as well. $100's or $1,000's?
On Wed, 3 Aug 2011, Todd wrote:
I am going to try an experiment with e-mail aggregation where I expect to receive over 1 million e-mails a day from public lists. Can anyone shed some light on hard disk space (to retain this e-mail for long periods) and system specs to be able to handle the load?
I am looking to buy a low end box, but that can hold lots of RAM and accomodate a fair number of HD's to store the e-mail while I try my experiments.
Can anyone provide some realistic specs while maintaining a small budget?
-Jason
---------------------------------------------------------------------- Jim Wildman, CISSP, RHCE jim@rossberry.com http://www.rossberry.net "Society in every state is a blessing, but Government, even in its best state, is a necessary evil; in its worst state, an intolerable one." Thomas Paine
On 08/03/11 10:53 AM, Todd wrote:
I am looking to buy a low end box, but that can hold lots of RAM and accomodate a fair number of HD's to store the e-mail while I try my experiments.
the HP DL180G6 is a nice box for those requirements. 2U server that can be configured with up to 2x6 core Xeon 5600 series processors, and up to 96GB ram without using really expensive memory (has 12 memory slots, 6 per CPU socket, so 6x8gb gets you 48GB, 12x8gb gets you 96gb), and has either 12 x 3.5" SAS/SATA or 25 x 2.5" SAS/SATA hotswap drives.
a million emails/day is an average of 12/second every single second of the day. I'd wager your file system had better be able to handle 3-4 times that so that bursts are handled gracefully. I'd definately recommend using raid10 with a fair number of disks for this as that's lots of small file creates.
what are you doing with this email when you recieve it, beyond just saving it?
Hi John,
what are you doing with this email when you recieve it, beyond just
saving it?
I plan to analysis the mail to group into e-mails on the same topic and create a comprehensive answer to the topics. Along the lines of a FAQ for topics that are continually being asked over and over as well as more advanced, obscure topics that people may want to chime into.
If I had $500 to spend, not counting money for hard disks, could I even get a machine for that? or do I really need to be scraping more cash together?
-Jason
On 08/03/11 11:20 AM, Todd wrote:
Hi John,
what are you doing with this email when you recieve it, beyond just saving it?
I plan to analysis the mail to group into e-mails on the same topic and create a comprehensive answer to the topics. Along the lines of a FAQ for topics that are continually being asked over and over as well as more advanced, obscure topics that people may want to chime into.
If I had $500 to spend, not counting money for hard disks, could I even get a machine for that? or do I really need to be scraping more cash together?
That machine I mentioned, configured with 2 x 6 core 2.8Ghz E5660's and 48GB ram, and the 25 bay SFF (2.5") drive chassis, redundant power, and a P411 RAID card with 1gb flash-back write-cache (equivalent to battery backed, but without needing battery replacements every 3-4 years) was about $8000 with a discount. and no disks.
for $500, you could get a low end desktop computer. or a HP microserver.
On Wed, 3 Aug 2011, John R Pierce wrote:
for $500, you could get a low end desktop computer. or a HP microserver.
Or lots of used servers to choose from on ebay that are much beefier than the one Russ mentioned.
---------------------------------------------------------------------- Jim Wildman, CISSP, RHCE jim@rossberry.com http://www.rossberry.net "Society in every state is a blessing, but Government, even in its best state, is a necessary evil; in its worst state, an intolerable one." Thomas Paine
On 8/3/2011 1:20 PM, Todd wrote:
Hi John,
what are you doing with this email when you recieve it, beyond just saving it?
I plan to analysis the mail to group into e-mails on the same topic and create a comprehensive answer to the topics. Along the lines of a FAQ for topics that are continually being asked over and over as well as more advanced, obscure topics that people may want to chime into.
Couldn't you do that by walking the list archives or a google search of a topic without receiving a copy of everything yourself?
On 08/04/2011 03:20 AM, Todd wrote:
Hi John,
what are you doing with this email when you recieve it, beyond just saving it?
I plan to analysis the mail to group into e-mails on the same topic and create a comprehensive answer to the topics. Along the lines of a FAQ for topics that are continually being asked over and over as well as more advanced, obscure topics that people may want to chime into.
If I had $500 to spend, not counting money for hard disks, could I even get a machine for that? or do I really need to be scraping more cash together? -Jason
From what was stated previously by RPH where he did the breakdown and shared his 750k/day experience, I'd say you could easily afford to build a system yourself minus the drives. Your problem may be affording the bandwidth to sustain the experiment, depending on where you live (here in Japan fat bandwidth is cheap, but we have trouble connecting to some specific places at high speeds sometimes, for example -- but domestically it is really amazing).
Of course, that addresses receipt of the messages, what sort of computer would be required to do the parsing and scanning in realtime, on the other hand, depends entirely on the sort of routines you want run. The cheap route is to collect cheaply over a period and stock the messages, and then switch to processing the collected data with whatever resources you have available once you've hit the point of diminishing returns on whatever storage solution you wind up building. In this way you can afford cheap processors if you are willing to pay in time instead of cash.
-Iwao
PS: Of course, if you don't mind dealing with dodgy Russians you could probably find a sponsor for just such an effort...
On 8/3/2011 1:13 PM, John R Pierce wrote:
the HP DL180G6 is a nice box for those requirements. 2U server that can be configured with up to 2x6 core Xeon 5600 series processors, and up to 96GB ram without using really expensive memory (has 12 memory slots, 6 per CPU socket, so 6x8gb gets you 48GB, 12x8gb gets you 96gb), and has either 12 x 3.5" SAS/SATA or 25 x 2.5" SAS/SATA hotswap drives.
a million emails/day is an average of 12/second every single second of the day. I'd wager your file system had better be able to handle 3-4 times that so that bursts are handled gracefully. I'd definately recommend using raid10 with a fair number of disks for this as that's lots of small file creates.
Your other problem will be dealing with all the spam you'll get if the email addresses are visible anywhere (and maybe even if they aren't).