Les Mikesell wrote:
Can you be more specific about how snmp is wrong and what you do to get a more accurate value? Is it just that the snmp value needs to be scaled by the number of processors?
Seems like the SNMPD included in CentOS 5.x has improved somewhat vs v4.
From the FAQ
What do the CPU statistics mean - is this the load average? ----------------------------------------------------------
No. Unfortunately, the original definition of the various CPU statistics was a little vague. It referred to a "percentage", without specifying what period this should be calculated over. It was therefore implemented slightly differently on different architectures.
Recent releases includes "raw counters", which can be used to calculate the percentage usage over any desired period. This is the "right" way to handle things in the SNMP model. The original flawed percentage objects should not be used, and will be removed in a future release of the agent.
Note that this is different from the Unix load average, which is available via the loadTable, and is supported on all architectures.
---
Older versions would basically spit out random values for CPU usage. For about the past 5 years I have used scripts that run out of cron, that run sar and parse the output and send the results to a file, then configure SNMP to tail that file when a particular OID is queried. This has given me really dependable results over the years.
[root@us-cfe002:/home/monitor/stats]# tail -n 1 * ==> disk.usage <== DISK_T:60707 DISK_U:9567
==> mem.usage <== RAM_T:3950 RAM_F:2732 RAM_B:58 RAM_C:731 SWAP_T:8189 SWAP_U:0
==> sar.usage <== USER:0.01 NICE:0.00 SYS:0.01 IO:0.00 FAULT:41.16 TCPSOCK:21
Last I checked as well the SNMP daemon didn't return cpu i/o wait values, which is pretty handy to have.
Then I have a script that queries the data(along with other data) and feeds it into cacti as a single set of results (to be stored in 1 RRD file) which really helps cacti scale
[cacti@dc1-mon002:~/bin]$ ./linux-basics-net.pl us-cfe002 public USER:0.01 NICE:0.00 SYS:0.02 IO:0.00 FAULT:61.78 TCPSOCK:21 RAM_T:3950 RAM_F:2732 RAM_B:58 RAM_C:731 SWAP_T:8189 SWAP_U:0 DISK_T:60707 DISK_U:9567 1MIN:0.00 5MIN:0.00 15MIN:0.00 E0_IN:747203652 E0_OUT:520021358 E1_IN:0 E1_OUT:0
Unfortunately with every passing revision of sar it becomes more and more difficult to parse, I really miss the version from RHEL 3 days, that one was great, it had a special human readable output option which has since been taken out (it would spit out each stat on one line making it easy to parse).
nate