We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too.
Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages.
Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt.
We do have an almost identical server, same hardware, same software which does not have this problem.
How could I proceed to better diagnose the cause of the troubles?
Regards,
Hi Peter,
I saw from your log message that you've lost connection to the storage device which you use in the server machine. My suggestion is please examine the connection between server and storage. CMIIW.
Rgds, Wahyu
Powered by Telkomsel BlackBerry®
-----Original Message----- From: Peter Hopfgartner peter.hopfgartner@r3-gis.com Sender: centos-virt-bounces@centos.org Date: Wed, 29 Feb 2012 08:53:09 To: Discussion about the virtualization on CentOScentos-virt@centos.org Reply-To: Discussion about the virtualization on CentOS centos-virt@centos.org Subject: [CentOS-virt] Guests pausing suddenly
We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too.
Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages.
Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt.
We do have an almost identical server, same hardware, same software which does not have this problem.
How could I proceed to better diagnose the cause of the troubles?
Regards,
Hi Wahyu,
I guess that the warnings related to FCoE are not that important. I forgot to mention, that the images of the virtual machines are on the physical server.
Thanks & Regards,
Peter
On 03/01/2012 01:02 AM, Wahyu Darmawan wrote:
Hi Peter,
I saw from your log message that you've lost connection to the storage device which you use in the server machine. My suggestion is please examine the connection between server and storage. CMIIW.
Rgds, Wahyu
Powered by Telkomsel BlackBerry®
-----Original Message----- From: Peter Hopfgartnerpeter.hopfgartner@r3-gis.com Sender: centos-virt-bounces@centos.org Date: Wed, 29 Feb 2012 08:53:09 To: Discussion about the virtualization on CentOScentos-virt@centos.org Reply-To: Discussion about the virtualization on CentOS centos-virt@centos.org Subject: [CentOS-virt] Guests pausing suddenly
We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too.
Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages.
Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt.
We do have an almost identical server, same hardware, same software which does not have this problem.
How could I proceed to better diagnose the cause of the troubles?
Regards,
The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
Reagrds,
Peter
On 02/29/2012 08:53 AM, Peter Hopfgartner wrote:
We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too.
Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages.
Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt.
We do have an almost identical server, same hardware, same software which does not have this problem.
How could I proceed to better diagnose the cause of the troubles?
Regards,
On 04/26/2012 02:29 AM, Peter Hopfgartner wrote:
The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
This could be being caused by failing areas on the underlaying disk drive. Particularly if you are using consumer grade hard drives instead of enterprise drives. The most relevant difference here is that consumer grade drives can try for up to a couple of minutes to read a bad sector and might eventually succeed if the error isn't too egregious while an enterprise drive will just quickly report the sector as unreadable and move on.
I would install smartmontools on the physical server and check the SMART status of the drive after running a 'long' test.
On 04/26/2012 03:32 PM, Benjamin Franz wrote:
On 04/26/2012 02:29 AM, Peter Hopfgartner wrote:
The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
This could be being caused by failing areas on the underlaying disk drive. Particularly if you are using consumer grade hard drives instead of enterprise drives. The most relevant difference here is that consumer grade drives can try for up to a couple of minutes to read a bad sector and might eventually succeed if the error isn't too egregious while an enterprise drive will just quickly report the sector as unreadable and move on.
Hallo Benjamin
thanks for your reply.
Isn't it strange, that the log entries are only on the guest VMs, not on the physical server? I'm not able to give an answer on this, due to my inexperience on this topic. Can I go and call the Dell assistance and tell them to handle me 2 new disks, since it is reasonable clear that one of those disks placed in server is flawed? Anyway, the machine is a Dell R410 Poweredge server with a hardware RAID PERC H200 Adapter and 2 600 GB SAS disks in RAID 1. The "twin" to this machine, that we purched together with this one, does not show the same behaviour. Anyway, it has a lighter load.
I would install smartmontools on the physical server and check the SMART status of the drive after running a 'long' test.
After some googling I've found how to do this with this RAID controller:
[root@xxx ~]# smartctl -a -T permissive /dev/sg1 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Device: SEAGATE ST3600057SS Version: ES64 Serial number: xxxxxxx Device type: disk Transport protocol: SAS Local Time is: Fri Apr 27 08:52:50 2012 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK
Current Drive Temperature: 37 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1806551142 Blocks received from initiator = 1325078948 Blocks read from cache and sent to initiator = 281977973 Number of read and write commands whose size <= segment size = 82709392 Number of read and write commands whose size > segment size = 183965 Vendor (Seagate/Hitachi) factory information number of hours powered up = 5526.70 number of minutes until next internal SMART test = 47
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 77855443 0 0 77855443 77855443 5989.204 0 write: 0 0 0 0 0 66665.246 0 verify: 35799949 0 0 35799949 35799949 3727.548 0
Non-medium error count: 3
SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed 16 3 - [- - -] # 2 Background long Completed 16 1 - [- - -] # 3 Background short Completed 16 0 - [- - -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
[root@xxx ~]# smartctl -a -T permissive /dev/sg2 smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Device: SEAGATE ST3600057SS Version: ES64 Serial number: xxxxxxx Device type: disk Transport protocol: SAS Local Time is: Fri Apr 27 08:57:10 2012 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK
Current Drive Temperature: 36 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 2858579160 Blocks received from initiator = 163698761 Blocks read from cache and sent to initiator = 3391810210 Number of read and write commands whose size <= segment size = 97415598 Number of read and write commands whose size > segment size = 183976 Vendor (Seagate/Hitachi) factory information number of hours powered up = 5526.82 number of minutes until next internal SMART test = 40
Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 265118506 1 0 265118507 265118507 50649.094 0 write: 0 0 0 0 0 66071.078 0 verify: 19656379 0 0 19656379 19656379 3586.762 0
Non-medium error count: 22
SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed 16 3 - [- - -] # 2 Background long Completed 16 1 - [- - -] # 3 Background short Completed 16 0 - [- - -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
How do I interpret these nunbers? To me, they look quite good.
Thanks,
Peter