hi all,
below is the output of top off a machine i have, sorted by CPU utilization.
my question is, how can all 8 CPUs say they're pegged at 100% user utilization, when there are just these 8 processes saying they're using 25% of a CPU each, and no other processes showing up taking any resources? i have run this test on other machines and they act as expected. i found this b/c this node was running a lot slower for some reason, and i can't figure out why. there are no error messages anywhere...
top - 00:16:34 up 8 days, 2:05, 1 user, load average: 8.40, 8.31, 8.24 Tasks: 156 total, 9 running, 147 sleeping, 0 stopped, 0 zombie Cpu0 : 97.2%us, 1.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st Cpu1 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 98.6%us, 1.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 98.6%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st Cpu6 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16424052k total, 13900392k used, 2523660k free, 10560k buffers Swap: 499988k total, 152k used, 499836k free, 28700k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8568 jgreen 25 0 2089m 1.7g 4184 R 23 10.5 109:06.90 xhpl 8572 jgreen 25 0 2080m 1.6g 4184 R 23 10.5 109:12.40 xhpl 8574 jgreen 25 0 2079m 1.6g 4220 R 23 10.5 109:07.43 xhpl 8575 jgreen 25 0 2063m 1.6g 4168 R 23 10.4 109:08.12 xhpl 8570 jgreen 25 0 2084m 1.6g 4188 R 23 10.5 109:07.52 xhpl 8571 jgreen 25 0 2067m 1.6g 4188 R 23 10.4 109:08.51 xhpl 8569 jgreen 25 0 2072m 1.6g 4196 R 22 10.4 109:07.77 xhpl 8573 jgreen 25 0 2062m 1.6g 4204 R 22 10.4 109:08.23 xhpl 4457 root 15 0 12056 1424 992 S 0 0.0 8:01.62 pbs_mom 1 root 15 0 10316 792 660 S 0 0.0 0:02.74 init 2 root RT -5 0 0 0 S 0 0.0 0:00.01 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/0 4 root RT -5 0 0 0 S 0 0.0 0:00.02 watchdog/0 5 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/1 6 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
thanks, --Joe
Greenseid, Joseph M (IS) wrote:
hi all,
below is the output of top off a machine i have, sorted by CPU utilization.
my question is, how can all 8 CPUs say they're pegged at 100% user utilization, when there are just these 8 processes saying they're using 25% of a CPU each, and no other processes showing up taking any resources? i have run this test on other machines and they act as expected. i found this b/c this node was running a lot slower for some reason, and i can't figure out why. there are no error messages anywhere...
What happens if you stop, say half, of the xhpl processes?
Mogens
below is the output of top off a machine i have, sorted by CPU utilization.
my question is, how can all 8 CPUs say they're pegged at 100% user utilization, when there are just these 8 processes saying they're using 25% of a CPU each, and no other processes showing up taking any resources? i have run this test on other machines and they act as expected. i found this b/c this node was running a lot slower for some reason, and i can't figure out why. there are no error messages anywhere...
What happens if you stop, say half, of the xhpl processes?
The way they interact, I can't really stop them while they're running, but I can run a configuration where there are only 4 of them running on the node. Would that give you similar information to what you wanted to see, or not be worth it?
--Joe
Greenseid, Joseph M (IS) wrote: ...
The way they interact, I can't really stop them while they're running, but I can run a configuration where there are only 4 of them running on the node. Would that give you similar information to what you wanted to see, or not be worth it?
Hm, another thing: Have you used the I toggle in top (IRIX vs. Sun)?
Mogens
Hm, another thing: Have you used the I toggle in top (IRIX vs. Sun)?
toggling "I" doesn't give me anything that would explain my confusion.
I have a few dozen systems with exactly matching hardware and exactly matching OS installations (image based installs). This one system is the only one that is showing these results -- when toggling the I switch, all the others act the same, while this one still shows anomalous results.
From what I can see, this one node is running slow. I don't have any MCE entries or anything to indicate a hardware fault, it just started running slow one day, and I when I was investigating saw the top output and was baffled by it.
--Joe
Greenseid, Joseph M (IS) wrote: ...
From what I can see, this one node is running slow. I don't have any MCE entries or anything to indicate a hardware fault, it just started running slow one day, and I when I was investigating saw the top output and was baffled by it.
Can you swap the disk drives with a node that works OK?
Mogens
On Mon, 2009-03-23 at 23:22 +0100, Mogens Kjaer wrote:
Greenseid, Joseph M (IS) wrote: ...
From what I can see, this one node is running slow. I don't have any MCE entries or anything to indicate a hardware fault, it just started running slow one day, and I when I was investigating saw the top output and was baffled by it.
Have you compared BIOS setups? Something tells me this is like interrupts related. Could even be poorly seated hardware: memory, power or data cables, video, etc.
IIRC, BIOS settings can also have an effect there.
Can you swap the disk drives with a node that works OK?
Mogens
HTH
Can you swap the disk drives with a node that works OK?
i will try that when i get back to the data center and can swap the nodes.
Have you compared BIOS setups? Something tells me this is like interrupts related. Could even be poorly seated hardware: memory, power or data cables, video, etc.
IIRC, BIOS settings can also have an effect there.
all the bios were flashed to the same version and configured the same before we started with these systems.
something must be wrong with the node for it to run slow, but i just didn't understand how top was showing such weird output.
thanks for the suggestions.
--Joe
on 3-24-2009 7:30 AM Greenseid, Joseph M (IS) spake the following:
Can you swap the disk drives with a node that works OK?
i will try that when i get back to the data center and can swap the nodes.
Have you compared BIOS setups? Something tells me this is like interrupts related. Could even be poorly seated hardware: memory, power or data cables, video, etc.
IIRC, BIOS settings can also have an effect there.
all the bios were flashed to the same version and configured the same before we started with these systems.
something must be wrong with the node for it to run slow, but i just didn't understand how top was showing such weird output.
thanks for the suggestions.
Could it be something more simple like a memory module in the wrong slot causing the system to not page-interleave the ram? I have had this happen when I depended on others to set up multiple systems.