[CentOS] Monitoring IO -- vmstat doesn't match snmp

Mon Nov 7 16:53:08 UTC 2011

I made the mistake of looking at disk IO numbers in two different ways --
now I'm confused, because they give inconsistent answers.

First way was using 'vmstat 10'.  This gave me (apologies for wrapped lines):

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
 2  0 2162944 4071928 162444 4218456    0    0     0   286 1103  528  3  2
95  0  0
 1  0 2162944 4071976 162448 4218440    0    0     0   301 1102  548  2  4
95  0  0
 2  0 2162944 4074488 162456 4218448    0    0     0   252 1097  501  1  4
96  0  0
 2  0 2162944 4081572 162480 4218508    0    0     0   430 1145 1006  2  3
95  0  0
 2  0 2162944 4079340 162488 4218508    0    0     0   354 1148  604  2  3
95  0  0
 1  0 2162944 4082604 162492 4218512    0    0     0   258 1105  446  1  4
96  0  0
 1  0 2162944 4084052 162500 4218520    0    0     0   300 1101  482  1  4
95  0  0
 1  0 2162944 4080652 162500 4218536    0    0     0   393 1118  585  1  3
95  0  0
 1  0 2162944 4081160 162500 4218536    0    0     0   304 1100  462  0  4
95  0  0
 1  0 2162944 4075636 162508 4218536    0    0     0   214 1132  397  0  4
96  0  0
 3  0 2162944 4081640 162516 4218540    0    0     0   332 1111  554  2  3
94  0  0
 1  0 2162944 4075104 162516 4218552    0    0     0   382 1179  566  2  3
95  0  0

The "bo" column, block out, is described in the man page as being blocks
per second.  I believe the blocks are 512 bytes.

Okay; but then I used SNMP to fetch 1.3.6.1.4.1.2021.11.57.0
(ssIORawSent).  That's an incrementing counter of blocks sent.  I'm
fetching it every 10 seconds, same as before.

3,204,124,952   1,603    820,736
3,204,139,960   1,500    768,000
3,204,155,848   1,588    813,056
3,204,164,600     875    448,000
3,204,184,536   1,993  1,020,416
3,204,194,184     964    493,568
3,204,204,040     896    458,752
3,204,218,696   1,465    750,080
3,204,235,224   1,652    845,824

The first column is the counter; the second column is the difference
between them divided by the actual number of seconds elapsed (i.e. it
tries to correct for imprecisions in the sleep; though in fact when I
monitored that, it was hitting the exact second consistently), i.e, the
second column is blocks per second.  And the third column is bytes per
second based on a 512-byte block.

You'll note that the blocks per second figures are not compatible with the
blocks per second figures from vmstat.

These two sets of numbers overlap, and the numbers before and after are
similar.

So what's up with that ?

(Here's my monitoring code that produced the second set of figures, in
case I did something dumb-ass:

#! /bin/bash
set -e

HOST=prcapp01
secs=10
lbc=0
lts=0

echo "blockcount bl/sec bytes/sec"
while true; do
    bc=$( snmpget -v 2c -c xxx $HOST 1.3.6.1.4.1.2021.11.57.0 | cut -d' '
-f4 )
    ts=$( date +%s )
    if [[ $lbc > 0 ]]; then
	(( bsec = ( bc - lbc ) / ( ts - lts ) ))
	(( bytes = bsec * 512 ))
	#echo $lbc $bc $bsec $bytes $lts $ts
	printf "%'12d %'7d %'10d\n" $bc $bsec $bytes
    fi
    lbc="$bc"
    lts="$ts"
    sleep $secs
done

I have obfuscated the read-only community name.)

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info