I have a C6 server acting as a kvm-host.
When connecting with ssh the console is extremely slow and hangs for minutes at a time. Connecting to this server is not the problem.
If I use: ssh root@host "whatever" I got immediate response even when interactive consoles opened with ssh are hanging.
Linux [...] 2.6.32-504.3.3.el6.x86_64 #1 SMP Wed Dec 17 01:55:02 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buffers cached Mem: 47 35 11 0 0 0 -/+ buffers/cache: 35 11 Swap: 7 0 7
Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg-lv_root 50G 6,4G 41G 14% / tmpfs 24G 0 24G 0% /dev/shm /dev/sda1 477M 123M 329M 28% /boot
13:33:34 up 1 day, 18:30, 2 users, load average: 3.39, 2.53, 2.36
(it's an 8-core)
Nothing particular in log/messages.
The vm's are running normally and they are not showing the same behaviour.
Can anybody give me a pointer?
Thanks Patrick
On 28-01-2015 11:15, Patrick Bervoets wrote:
I have a C6 server acting as a kvm-host.
When connecting with ssh the console is extremely slow and hangs for minutes at a time. Connecting to this server is not the problem.
If I use: ssh root@host "whatever" I got immediate response even when interactive consoles opened with ssh are hanging.
Sorry, is it hanging during the session or while attempting to establish a new one? If this last, it may be dns and ssh -v may help. The former is weird, I don't think I ever saw it.
Marcelo
Op 28-01-15 om 17:20 schreef Marcelo Ricardo Leitner:
On 28-01-2015 11:15, Patrick Bervoets wrote:
I have a C6 server acting as a kvm-host.
When connecting with ssh the console is extremely slow and hangs for minutes at a time. Connecting to this server is not the problem.
If I use: ssh root@host "whatever" I got immediate response even when interactive consoles opened with ssh are hanging.
Sorry, is it hanging during the session or while attempting to establish a new one? If this last, it may be dns and ssh -v may help. The former is weird, I don't think I ever saw it.
Marcelo
Marcelo,
It hangs during the session. Once I'm logged in and beginning to type it displays 3-5 chars and then hangs for up to 15 minutes, a few more chars, wait, and so on. Checked my resolv.conf; added 'options single-request-reopen' though I don't know if that is helping.
Yes it is weird; even more that individual commands sent with ssh gives immediate respons.
Thanks Patrick
Hi Patrick have you ever tried to find out on which side the hanger is: on the client's or on the server's, using tcpumg or the like? That migth help a bit further on, that might.
suomi
On 01/28/2015 05:41 PM, Patrick Bervoets wrote:
Op 28-01-15 om 17:20 schreef Marcelo Ricardo Leitner:
On 28-01-2015 11:15, Patrick Bervoets wrote:
I have a C6 server acting as a kvm-host.
When connecting with ssh the console is extremely slow and hangs for minutes at a time. Connecting to this server is not the problem.
If I use: ssh root@host "whatever" I got immediate response even when interactive consoles opened with ssh are hanging.
Sorry, is it hanging during the session or while attempting to establish a new one? If this last, it may be dns and ssh -v may help. The former is weird, I don't think I ever saw it.
Marcelo
Marcelo,
It hangs during the session. Once I'm logged in and beginning to type it displays 3-5 chars and then hangs for up to 15 minutes, a few more chars, wait, and so on. Checked my resolv.conf; added 'options single-request-reopen' though I don't know if that is helping.
Yes it is weird; even more that individual commands sent with ssh gives immediate respons.
Thanks Patrick
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Op 28-01-15 om 17:51 schreef anax:
Hi Patrick have you ever tried to find out on which side the hanger is: on the client's or on the server's, using tcpumg or the like? That migth help a bit further on, that might.
suomi
I'm not sure what you mean with tcpumg. But after testing with a physical console I'm experiencing the same problem. So I guess its the server.
Thanks
On Thu, Jan 29, 2015 at 2:34 AM, Patrick Bervoets < patrick.bervoets@psc-elsene.be> wrote:
Op 28-01-15 om 17:51 schreef anax:
Hi Patrick have you ever tried to find out on which side the hanger is: on the client's or on the server's, using tcpumg or the like? That migth help a bit further on, that might.
suomi
I'm not sure what you mean with tcpumg.
But after testing with a physical console I'm experiencing the same problem. So I guess its the server.
Thanks
Probably meant tcpdump.
Op 28-01-15 om 20:17 schreef Gordon Messmer:
On 01/28/2015 05:15 AM, Patrick Bervoets wrote:
When connecting with ssh the console is extremely slow and hangs for minutes at a time
Check for IP address conflicts in the server's network.
For IPv4: # arping -D -I <interface> <address>
ARPING 192.168.1.15 from 0.0.0.0 br0 Unicast reply from 192.168.1.15 [AC:16:2D:72:67:D4] 0.723ms Sent 1 probes (1 broadcast(s)) Received 1 response(s)
Thanks anyway Patrick
On 01/28/2015 12:12 PM, Patrick Bervoets wrote:
ARPING 192.168.1.15 from 0.0.0.0 br0 Unicast reply from 192.168.1.15 [AC:16:2D:72:67:D4] 0.723ms Sent 1 probes (1 broadcast(s)) Received 1 response(s)
Thanks anyway
I'm not sure what you mean by "thanks anyway".
You got a response. There's an IPv4 conflict on your network. That's why you're seeing those delays. If there's no conflict, you should see 0 responses.
Op 29-01-15 om 00:00 schreef Gordon Messmer:
On 01/28/2015 12:12 PM, Patrick Bervoets wrote:
ARPING 192.168.1.15 from 0.0.0.0 br0 Unicast reply from 192.168.1.15 [AC:16:2D:72:67:D4] 0.723ms Sent 1 probes (1 broadcast(s)) Received 1 response(s)
Thanks anyway
I'm not sure what you mean by "thanks anyway".
You got a response. There's an IPv4 conflict on your network. That's why you're seeing those delays. If there's no conflict, you should see 0 responses.
Gordon,
I'm sorry, I misunderstood you (and arping -D) This was the result of arping on another host; I thought I should see 2 responses in case of an ip conflict.
Arping on the troublesome server gives 0 responses.
I just tried with a physical console on that server and there I got the same unresponsive behaviour. Does this rule out network related problems?
Mark (m.roth) suggested the vms eating up the video bus. (2 vms with an Oracle database) But I'm not sure how I could test that.
Patrick
On 01/28/2015 11:28 PM, Patrick Bervoets wrote:
Arping on the troublesome server gives 0 responses.
I just tried with a physical console on that server and there I got the same unresponsive behaviour.
Well, that's a different story, then. :)
I haven't seen delays anywhere near that long before, even with heavy swapping. But I guess I'd look at that sort of thing first.
Run "iostat -x 2" and see if your disks are being fully utilized during the pauses. Run "top" and see if there's anything useful there. Check swap use with "free". Try decreasing swappiness with "echo 10
/proc/sys/vm/swappiness"
Op 29-01-15 om 21:21 schreef Gordon Messmer:
I haven't seen delays anywhere near that long before, even with heavy swapping. But I guess I'd look at that sort of thing first.
Run "iostat -x 2" and see if your disks are being fully utilized during the pauses. Run "top" and see if there's anything useful there. Check swap use with "free". Try decreasing swappiness with "echo 10 >/proc/sys/vm/swappiness" _______________________________________________
iostat random sample avg-cpu: %user %nice %system %iowait %steal %idle 3,77 0,00 1,45 0,00 0,00 94,78
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,50 0,00 11,00 0,00 136,00 12,36 0,00 0,00 0,00 0,00 sdb 0,00 0,00 0,00 11,50 0,00 148,00 12,87 0,00 0,09 0,09 0,10 sdc 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 dm-0 0,00 0,00 0,00 4,00 0,00 32,00 8,00 0,00 0,00 0,00 0,00 dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 dm-2 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 dm-3 0,00 0,00 0,00 11,50 0,00 148,00 12,87 0,00 0,13 0,13 0,15 dm-4 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 dm-5 0,00 0,00 0,00 7,50 0,00 104,00 13,87 0,00 0,07 0,07 0,05
atop ATOP - 2015/01/30 10:18:14 --------- 10s elapsed PRC | sys 3.87s | user 14.93s | #proc 197 | #zombie 0 | #exit 0 | CPU | sys 30% | user 119% | irq 1% | idle 533% | wait 0% | cpu | sys 2% | user 21% | irq 0% | idle 56% | cpu000 w 0% | cpu | sys 3% | user 19% | irq 0% | idle 59% | cpu001 w 0% | cpu | sys 8% | user 15% | irq 0% | idle 62% | cpu003 w 0% | cpu | sys 3% | user 13% | irq 0% | idle 73% | cpu002 w 0% | cpu | sys 3% | user 14% | irq 0% | idle 70% | cpu006 w 0% | cpu | sys 4% | user 15% | irq 0% | idle 66% | cpu005 w 0% | cpu | sys 2% | user 11% | irq 0% | idle 77% | cpu007 w 0% | cpu | sys 5% | user 11% | irq 0% | idle 73% | cpu004 w 0% | CPL | avg1 1.92 | avg5 1.97 | avg15 1.61 | csw 229508 | intr 191786 | MEM | tot 47.1G | free 15.9G | cache 519.3M | buff 109.3M | slab 353.3M | SWP | tot 7.8G | free 7.3G | | vmcom 31.8G | vmlim 31.3G | LVM | g_15k-lv_15k | busy 0% | read 1 | write 98 | avio 0.15 ms | LVM | to-lv_oracle | busy 0% | read 0 | write 66 | avio 0.06 ms | LVM | v_oracletest | busy 0% | read 0 | write 79 | avio 0.05 ms | LVM | uito-lv_root | busy 0% | read 0 | write 1 | avio 3.00 ms | DSK | sdb | busy 0% | read 1 | write 98 | avio 0.16 ms | DSK | sda | busy 0% | read 0 | write 146 | avio 0.08 ms | NET | transport | tcpi 12 | tcpo 12 | udpi 0 | udpo 0 | NET | network | ipi 13 | ipo 12 | ipfrw 0 | deliv 12 | NET | vnet0 8% | pcki 2273 | pcko 2581 | si 850 Kbps | so 458 Kbps | NET | vnet1 4% | pcki 2186 | pcko 2075 | si 391 Kbps | so 422 Kbps | NET | eth0 0% | pcki 1330 | pcko 1432 | si 159 Kbps | so 537 Kbps | NET | br0 ---- | pcki 43 | pcko 22 | si 1 Kbps | so 4 Kbps |
PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1960 2.37s 9.23s 0K 0K 8K 2520K -- - S 101% qemu-kvm 1990 0.69s 5.65s 0K 0K 0K 1196K -- - S 55% qemu-kvm 1975 0.50s 0.00s 0K 0K 0K 0K -- - S 4% kvm-pit-wq 2009 0.20s 0.00s 0K 0K 0K 0K -- - S 2% kvm-pit-wq 23321 0.05s 0.02s 0K 0K 0K 0K -- - R 1% atop 18384 0.05s 0.01s 0K 0K 0K 0K -- - S 1% atop 1719 0.00s 0.01s 0K 0K 0K 0K -- - S 0% hpasmlited 1746 0.00s 0.01s 0K 0K 0K 0K -- - S 0% hp-asrd 35 0.01s 0.00s 0K 0K 0K 0K -- - D 0% events/0 10707 0.00s 0.00s 0K 0K 0K 0K -- - S 0% arping 10740 0.00s 0.00s 0K 0K 0K 0K -- - S 0% arping 58 0.00s 0.00s 0K 0K 0K 0K -- - S 0% kblockd/0 18425 0.00s 0.00s 0K 0K 0K 0K -- - S 0% flush-253:0
free total used free shared buffers cached Mem: 48218 31895 16323 0 108 519 -/+ buffers/cache: 31267 16951 Swap: 7951 476 7475
But I had the same pauses when free gave zero swap.
If swap is the problem: would it matter if a command is run with ssh (ssh @ "command") or in a shell?
When running atop in a shell I observed pauses between screen updates longer than 10 seconds but atop displayed the time as "10 seconds later". So drifting away in time. While a date command sent a the same time gave the correct date.
So it seems like the screens are buffered and are being displayed with a delay.
On 1/30/2015 1:21 AM, Patrick Bervoets wrote:
free total used free shared buffers cached Mem: 48218 31895 16323 0 108 519 -/+ buffers/cache: 31267 16951 Swap: 7951 476 7475
thats an unusually small amount of 'cached'... I usually see the disk cache as 30-50% of the total memory. does this system not use much disk IO ?
Op 30-01-15 om 10:29 schreef John R Pierce:
On 1/30/2015 1:21 AM, Patrick Bervoets wrote:
free total used free shared buffers cached Mem: 48218 31895 16323 0 108 519 -/+ buffers/cache: 31267 16951 Swap: 7951 476 7475
thats an unusually small amount of 'cached'... I usually see the disk cache as 30-50% of the total memory. does this system not use much disk IO
it's a kvm-host with lvm, the vm's all have there own lv's (some on a different pv). Would that explain the small cache?
On 01/30/2015 01:21 AM, Patrick Bervoets wrote:
iostat random sample
"Random" is difficult to evaluate. Is that representative? Are sda, sdb, and sdc typically less than 1% utilized? Or are there large utilization values right after a hang?
If swap is the problem: would it matter if a command is run with ssh (ssh @ "command") or in a shell?
Let's assume it's not, but I would say "no" to the question. I'd expect the same delays regardless, if the system were swapping heavily.
When running atop in a shell I observed pauses between screen updates longer than 10 seconds but atop displayed the time as "10 seconds later". So drifting away in time. While a date command sent a the same time gave the correct date.
That's really weird.
Does the time displayed by "atop" eventually catch up?
Does the problem persist across reboots?
Is this system running ntpd?
Does the problem persist if you turn ntpd off and reboot?
Op 30-01-15 om 19:40 schreef Gordon Messmer:
On 01/30/2015 01:21 AM, Patrick Bervoets wrote:
iostat random sample
"Random" is difficult to evaluate. Is that representative? Are sda, sdb, and sdc typically less than 1% utilized? Or are there large utilization values right after a hang?
All the output was in the same scale and during a hang in an other shell.
Does the time displayed by "atop" eventually catch up?
Not that I know. But I gave up :-)
Does the problem persist across reboots?
Alas, one of the vm's is our production database. My next update/reboot window is next saturday. But I had the problem just before the last reboot (halfway january). But hadn't closely monitored it afterwards. Before - in december - I never experienced it. But it's a server I tend do leave alone, so I'm never very busy on a shell.
Is this system running ntpd?
yes
Does the problem persist if you turn ntpd off and reboot? _
I'll check that next week.
On 01/30/2015 12:32 PM, Patrick Bervoets wrote:
Before - in december - I never experienced it. But it's a server I tend do leave alone, so I'm never very busy on a shell.
Do you know what kernel you were running at the time? It might be useful to see if reverting to that revision changes the symptoms.
Op 30-01-15 om 21:51 schreef Gordon Messmer:
On 01/30/2015 12:32 PM, Patrick Bervoets wrote:
Before - in december - I never experienced it. But it's a server I tend do leave alone, so I'm never very busy on a shell.
Do you know what kernel you were running at the time? It might be useful to see if reverting to that revision changes the symptoms.
IIRC before the problem: kernel.x86_64 0:2.6.32-504.el6 problem occured during kernel.x86_64 0:2.6.32-504.1.3.el6 actual kernel.x86_64 0:2.6.32-504.3.3.el6
But since there is already a new kernel waiting; I'm not sure what to do. I think I'll first upgrade & test. If my maintenance window permits I'll test downgrading (but 3 updates...)
BTW I've got 3 other kvm-servers without this behavior (but they are completely different machines so not much to compare)