nate wrote:
That particular cacti server is collecting roughly 20 million data points daily(14,500/minute). *Heavily* customized for higher scalability.
Have you looked at OpenNMS for this? It's java with a postgresql backend for some data and jrobin (equivalent to rrd) for some. It needs a lot of RAM and has the same i/o bottleneck as anything else updating large numbers of rrd files but otherwise is pretty scalable and includes a lot more features than cacti.
Not recently, the main issue as you mention is I/O bottleneck. I've modified my cacti stuff so much that it minimizes the amount of I/O required. I average 9.2 data points per RRD, a lot of other systems(including one I wrote several years ago) typically put 1 data point per rrd, which makes for horrible scaling.
OpenNMS has a 'store-by-group' option that is supposed to help but I haven't tried it because you have to start over with the history. It would be as tunable as anything else as far as what is stored and how often.
The downside is that the amount of management overhead required to add a new system to cacti is obscene, but everything has it's trade offs I guess.
That's one of the beauties of opennms - it will autodiscover ranges and pretty much take care of itself except for grouping related machines for graph pages.
For monitoring our storage array I went even farther in that the only thing I'm using cacti for is to display the data, all data collection and RRD updates occur outside of cacti. Mainly because cacti's architecture wouldn't be able to scale gracefully to gather stats from our array, which would be represented by a single host, but have more than 6,000 points of data to collect per minute. With cacti's spine it distributes the load on a per-host basis, and a host can't span more than one thread. Also my system detects new things as they are added to the array automatically and creates/updates RRDs for them(though data isn't visible in cacti until they are manually added to the UI).
I think opennms defaults to re-probing for new things daily, but that would be tunable.
At some point I'll get some time to check out other solutions again, for now at least for my needs cacti sucks the least (not denying that it does suck). And their road map doesn't inspire confidence long term. But as long as things are in RRDs they are portable.
Opennms has some fairly serious ongoing work.
I just wish that there was an easier way to provide a UI to rrdtool directly, I used to use rrdcgi several years ago though many people are spoiled by the cacti UI so that's one reason I've gone to it. I'm not a programmer, so my own abilities to provide a UI are really limited but I can make a pretty scalable back end system without much trouble(been using RRD for 6 years now).
It has the option of using rrdtool or jrobin, with the tradeoffs that jrobin is in-process and uses a portable (to java) file format and rrdtool is an external process with non-portable files - but ones that other tools know something about. There's not much difference in their capabilities or output. The web UI is sort-of separate, running as jsp pages either under tomcat or embedded jetty. Part of the ongoing development is aimed at some sort of API around it to make it easier to customise the UI though. Plus it can collect WMI, JMX, and some other stuff that cacti can't and integrates with some other things like RT and rancid.