[CentOS] Disk error

Tue Jan 8 00:31:04 UTC 2013
Emmett Culley <emmett at webengineer.com>

On 01/07/2013 03:43 PM, Mark LaPierre wrote:
> On 01/07/2013 06:24 PM, Brian Mathis wrote:
>> On Mon, Jan 7, 2013 at 5:58 PM, Emmett Culley<emmett at webengineer.com>  wrote:
>>> For some time I have been seeing disk errors in the syslog every seven days.  Until today it always happens Sunday morning at 8:13 AM, plus or minus a minute or two.  Yesterday it happened at 1:13 AM.  Here are the pertinent log entries for the latest occurrence:
>> [...]
>>> Jan  6 01:13:25 g2 kernel:         res 51/40:00:db:bf:d6/40:00:04:00:00/00 Emask 0x9 (media error)
>> [...]
>>> Jan  6 01:13:25 g2 kernel: sd 8:0:0:0: [sdg] Add. Sense: Unrecovered read error - auto reallocate failed
>> [...]
>>> There is nothing in /etc/cron.weekly, nor are there any root crontab entries.  Any suggestions for investigating this issue would be much appreciated.
>>>
>>> Emmett
>>>
>>
>> Based on this I'd say your disk is going bad, and has run out of spare sectors:
>>       Jan  6 01:13:25 g2 kernel: sd 8:0:0:0: [sdg] Add. Sense:
>>       Unrecovered read error - auto reallocate failed
>>
>> You can use smartctl to get some information from the SMART tables,
>> but I've never been able to get a conclusive test out of the testing
>> options.  It would be a good idea to run 'badblocks' against the drive
>> as well, as it will definitely tell you if there are bad sectors.
>>
>> Disks are so cheap it's usually not worth too much effort or delay
>> once you've found out that it's bad.
>>
>>
>> ❧ Brian Mathis
> How do you explain the regular timing of the errors?  Is there a
> process, maybe a backup or something, that runs at this time every
> Sunday morning Mr. Mathis?
>
>
I Just looked a the backup process and noticed that an incremental backup started at 1:00 AM.  However none of the other backups listed for this machine correlate in any way to the times that the disk errors re reported.

As this is a host for multiple VMs it might be a good idea to look on each VM for cron jobs running at the time of the disk errors. I'll look there next.

The drive the error reports concern is mounted  via mdadm as /boot, so I was able to unmount it, stop the raid and run bad blocks via e2fsck.  That reports:

Checking for bad blocks (read-only test): done
/dev/sdg1: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdg1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdg1: 67/128016 files (7.5% non-contiguous), 165468/511988 blocks

So I"ll wait until to see it it happens next Sunday.

Emmett


Emmett