On 04/26/2012 02:29 AM, Peter Hopfgartner wrote:
The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
This could be being caused by failing areas on the underlaying disk drive. Particularly if you are using consumer grade hard drives instead of enterprise drives. The most relevant difference here is that consumer grade drives can try for up to a couple of minutes to read a bad sector and might eventually succeed if the error isn't too egregious while an enterprise drive will just quickly report the sector as unreadable and move on.
I would install smartmontools on the physical server and check the SMART status of the drive after running a 'long' test.