The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
Reagrds,
Peter
On 02/29/2012 08:53 AM, Peter Hopfgartner wrote:
We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too.
Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages.
Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt.
We do have an almost identical server, same hardware, same software which does not have this problem.
How could I proceed to better diagnose the cause of the troubles?
Regards,