[CentOS] bizzare performance problem

Wed May 11 22:43:08 UTC 2011
Mag Gam <magawake at gmail.com>

I had a rather strange problem last week with one of our 8 core
servers. The users complained the performance was "slow" so I checked
the basic things, processes on top, vmstat for memory and context
switching, i/o stats for internal disk I/O, netstat for any network
issues and other things like network through put by copying a large
file (1gb file across the network).

It turned out I had an NMI related issue on the processor. I figured
this out by checking the /var/log/messages but it was a real mystery
for be at first. My question, is there a way to detect or benchmark a
system and all of its processors to make sure I don't bypass this type
of error again? I am not necessary looking for monitoring tools but
more of techniques like, run a while loop on all processors/cores to
make sure they all give a constant time?

TIA