[CentOS-virt] Guests pausing suddenly

Thu Apr 26 09:29:17 UTC 2012
Peter Hopfgartner <peter.hopfgartner at r3-gis.com>

The problem got slightly better when I upgraded all kernels, on host and 
guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the 
problem is not solved, yet.
A maybe stupid question: If the kernel in the guest sees an I/O error on 
sda, could this be a real error on the physical disk, even if there are 
no notices in the physical hosts log files, or is this more of a 
software problem?

As the next step, I'll try to update the physical servers firmware.

Any suggestion on this topic is welcome, even more then before.

Reagrds,

Peter

On 02/29/2012 08:53 AM, Peter Hopfgartner wrote:
> We have a CentOS 6.2 server with KVM. That server hosts 2 virtual
> machines, both with Centos 6.2, too.
>
> Regularly, one or both of the virtual machines pass to state "pause"
> without apparent reason.
> On resume, I do get have messages, like the following in /var/log/messages.
>
> Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad
> Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code
> Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result:
> hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00
> 06 db 70 78 00 00 38 00
> Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector
> 115044472
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252047
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252048
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252049
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252050
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252051
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252052
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0,
> logical block 14252053
> Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0
> Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused
>
>
> I could not find any sensible message on the pysical host, neither in
> /var/log/messages nor in /var/log/libvirt.
>
> We do have an almost identical server, same hardware, same software
> which does not have this problem.
>
> How could I proceed to better diagnose the cause of the troubles?
>
> Regards,
>


-- 
Peter Hopfgartner
web  : http://www.r3-gis.com