[CentOS] CPU usage over estimated?

Fri Jun 5 18:16:54 UTC 2009
nate <centos at linuxpowered.net>

Les Mikesell wrote:
> nate wrote:
>>>
>> I wrote a few scripts that get CPU usage and feed it into SNMP for
>> retrieval for my cacti systems.
>>
>> My company used to rely on the built in linux SNMP stuff for cpu
>> usage(before I was hired) and they complained how it always seemed
>> to max out at 50%(on a dual cpu system).
>>
>> I've been using my own methods of CPU usage extraction using sar
>> for about 6 years now and it works great, only downside is sar
>> keeps being re-written and with every revision they make it harder
>> and harder to parse it(RHEL 3 was the easiest by far).
>>
>> Sample graph -
>> http://portal.aphroland.org/~aphro/cacti-cpu.png
>>
>> That particular cacti server is collecting roughly 20 million data
>> points daily(14,500/minute). *Heavily* customized for higher
>> scalability.
>
> Have you looked at OpenNMS for this?  It's java with a postgresql
> backend for some data and jrobin (equivalent to rrd) for some.  It needs
> a lot of RAM and has the same i/o bottleneck as anything else updating
> large numbers of rrd files but otherwise is pretty scalable and includes
> a lot more features than cacti.

Not recently, the main issue as you mention is I/O bottleneck. I've
modified my cacti stuff so much that it minimizes the amount of I/O
required. I average 9.2 data points per RRD, a lot of other
systems(including one I wrote several years ago) typically put 1
data point per rrd, which makes for horrible scaling. The downside
is that the amount of management overhead required to add a new
system to cacti is obscene, but everything has it's trade offs
I guess.

For monitoring our storage array I went even farther in that the
only thing I'm using cacti for is to display the data, all data
collection and RRD updates occur outside of cacti. Mainly because
cacti's architecture wouldn't be able to scale gracefully to
gather stats from our array, which would be represented by a single
host, but have more than 6,000 points of data to collect per
minute. With cacti's spine it distributes the load on a per-host
basis, and a host can't span more than one thread. Also my system
detects new things as they are added to the array automatically
and creates/updates RRDs for them(though data isn't visible in
cacti until they are manually added to the UI).

Even with 14,500 data point updates per minute, the amount of I/O
required is trivial, takes about 20 seconds to do each run(much of
that time is data collection). I used to host it on NFS, though
the NFS cluster software wasn't optimized for the type of I/O
rrdtool does, so it was quite a bit slower. Once I moved to
iSCSI(same back end storage as the NFS cluster), performance
went up 6x, and I/O wait went almost to 0.

At some point I'll get some time to check out other solutions
again, for now at least for my needs cacti sucks the least
(not denying that it does suck). And their road map doesn't
inspire confidence long term. But as long as things are in
RRDs they are portable.

I just wish that there was an easier way to provide a UI to
rrdtool directly, I used to use rrdcgi several years ago though
many people are spoiled by the cacti UI so that's one reason I've
gone to it. I'm not a programmer, so my own abilities to provide
a UI are really limited but I can make a pretty scalable back
end system without much trouble(been using RRD for 6 years now).

nate