Les Mikesell wrote:
nate wrote:
I wrote a few scripts that get CPU usage and feed it into SNMP for retrieval for my cacti systems.
My company used to rely on the built in linux SNMP stuff for cpu usage(before I was hired) and they complained how it always seemed to max out at 50%(on a dual cpu system).
I've been using my own methods of CPU usage extraction using sar for about 6 years now and it works great, only downside is sar keeps being re-written and with every revision they make it harder and harder to parse it(RHEL 3 was the easiest by far).
Sample graph - http://portal.aphroland.org/~aphro/cacti-cpu.png
That particular cacti server is collecting roughly 20 million data points daily(14,500/minute). *Heavily* customized for higher scalability.
Have you looked at OpenNMS for this? It's java with a postgresql backend for some data and jrobin (equivalent to rrd) for some. It needs a lot of RAM and has the same i/o bottleneck as anything else updating large numbers of rrd files but otherwise is pretty scalable and includes a lot more features than cacti.
Not recently, the main issue as you mention is I/O bottleneck. I've modified my cacti stuff so much that it minimizes the amount of I/O required. I average 9.2 data points per RRD, a lot of other systems(including one I wrote several years ago) typically put 1 data point per rrd, which makes for horrible scaling. The downside is that the amount of management overhead required to add a new system to cacti is obscene, but everything has it's trade offs I guess.
For monitoring our storage array I went even farther in that the only thing I'm using cacti for is to display the data, all data collection and RRD updates occur outside of cacti. Mainly because cacti's architecture wouldn't be able to scale gracefully to gather stats from our array, which would be represented by a single host, but have more than 6,000 points of data to collect per minute. With cacti's spine it distributes the load on a per-host basis, and a host can't span more than one thread. Also my system detects new things as they are added to the array automatically and creates/updates RRDs for them(though data isn't visible in cacti until they are manually added to the UI).
Even with 14,500 data point updates per minute, the amount of I/O required is trivial, takes about 20 seconds to do each run(much of that time is data collection). I used to host it on NFS, though the NFS cluster software wasn't optimized for the type of I/O rrdtool does, so it was quite a bit slower. Once I moved to iSCSI(same back end storage as the NFS cluster), performance went up 6x, and I/O wait went almost to 0.
At some point I'll get some time to check out other solutions again, for now at least for my needs cacti sucks the least (not denying that it does suck). And their road map doesn't inspire confidence long term. But as long as things are in RRDs they are portable.
I just wish that there was an easier way to provide a UI to rrdtool directly, I used to use rrdcgi several years ago though many people are spoiled by the cacti UI so that's one reason I've gone to it. I'm not a programmer, so my own abilities to provide a UI are really limited but I can make a pretty scalable back end system without much trouble(been using RRD for 6 years now).
nate