Re: [CentOS-virt] Guests pausing suddenly

27 Apr 2012


      On 04/26/2012 03:32 PM, Benjamin Franz wrote:
...
On 04/26/2012 02:29 AM, Peter Hopfgartner wrote:
...
The problem got slightly better when I upgraded all kernels, on host and
guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the
problem is not solved, yet.
A maybe stupid question: If the kernel in the guest sees an I/O error on
sda, could this be a real error on the physical disk, even if there are
no notices in the physical hosts log files, or is this more of a
software problem?
As the next step, I'll try to update the physical servers firmware.
Any suggestion on this topic is welcome, even more then before.
This could be being caused by failing areas on the underlaying disk
drive. Particularly if you are using consumer grade hard drives instead
of enterprise drives. The most relevant difference here is that consumer
grade drives can try for up to a couple of minutes to read a bad sector
and might eventually succeed if the error isn't too egregious while an
enterprise drive will just quickly report the sector as unreadable and
move on.
Hallo Benjamin
thanks for your reply.
Isn't it strange, that the log entries are only on the guest VMs, not on 
the physical server? I'm not able to give an answer on this, due to my 
inexperience on this topic. Can I go and call the Dell assistance and 
tell them to handle me 2 new disks, since it is reasonable clear that 
one of those disks placed in server is flawed?
Anyway, the machine is a Dell R410 Poweredge server with a hardware RAID 
PERC H200 Adapter and 2 600 GB SAS disks in RAID 1.
The "twin" to this machine, that we purched together with this one, does 
not show the same behaviour. Anyway, it has a lighter load.
...
I would install smartmontools on the physical server and check the SMART
status of the drive after running a 'long' test.
After some googling I've found how to do this with this RAID controller:
[root@xxx ~]# smartctl -a -T permissive /dev/sg1
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Device: SEAGATE  ST3600057SS      Version: ES64
Serial number: xxxxxxx
Device type: disk
Transport protocol: SAS
Local Time is: Fri Apr 27 08:52:50 2012 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
   Blocks sent to initiator = 1806551142
   Blocks received from initiator = 1325078948
   Blocks read from cache and sent to initiator = 281977973
   Number of read and write commands whose size <= segment size = 82709392
   Number of read and write commands whose size > segment size = 183965
Vendor (Seagate/Hitachi) factory information
   number of hours powered up = 5526.70
   number of minutes until next internal SMART test = 47
Error counter log:
            Errors Corrected by           Total   Correction     
Gigabytes    Total
                ECC          rereads/    errors   algorithm      
processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:   77855443        0         0  77855443   77855443       
5989.204           0
write:         0        0         0         0          0      
66665.246           0
verify: 35799949        0         0  35799949   35799949       
3727.548           0
Non-medium error count:        3
SMART Self-test log
Num  Test              Status                 segment  LifeTime  
LBA_first_err [SK ASC ASQ]
      Description                              number   (hours)
# 1  Background long   Completed                  16       
3                 - [-   -    -]
# 2  Background long   Completed                  16       
1                 - [-   -    -]
# 3  Background short  Completed                  16       
0                 - [-   -    -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
[root@xxx ~]# smartctl -a -T permissive /dev/sg2
smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Device: SEAGATE  ST3600057SS      Version: ES64
Serial number: xxxxxxx
Device type: disk
Transport protocol: SAS
Local Time is: Fri Apr 27 08:57:10 2012 CEST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
Current Drive Temperature:     36 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
   Blocks sent to initiator = 2858579160
   Blocks received from initiator = 163698761
   Blocks read from cache and sent to initiator = 3391810210
   Number of read and write commands whose size <= segment size = 97415598
   Number of read and write commands whose size > segment size = 183976
Vendor (Seagate/Hitachi) factory information
   number of hours powered up = 5526.82
   number of minutes until next internal SMART test = 40
Error counter log:
            Errors Corrected by           Total   Correction     
Gigabytes    Total
                ECC          rereads/    errors   algorithm      
processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:   265118506        1         0  265118507   265118507      
50649.094           0
write:         0        0         0         0          0      
66071.078           0
verify: 19656379        0         0  19656379   19656379       
3586.762           0
Non-medium error count:       22
SMART Self-test log
Num  Test              Status                 segment  LifeTime  
LBA_first_err [SK ASC ASQ]
      Description                              number   (hours)
# 1  Background long   Completed                  16       
3                 - [-   -    -]
# 2  Background long   Completed                  16       
1                 - [-   -    -]
# 3  Background short  Completed                  16       
0                 - [-   -    -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]
How do I interpret these nunbers? To me, they look quite good.
Thanks,
Peter
-- 
Peter Hopfgartner
web  : http://www.r3-gis.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [CentOS-virt] Guests pausing suddenly