[CentOS] Instability with later 4.x kernels?

Thu Jun 4 04:54:36 UTC 2009
Benjamin Smith <lists at benjamindsmith.com>

I have an Athlon with about 10 HDDs plugged in, primarily to do Disk2Disk 
backups. Some drives are PATA, some are SATA, some are USB. A strange 
concoction, but it's been relatively stable for some 4-5 years, despite 
numerous upgrades and so on. It's been running CentOS 4 for a long, long time. 
(years) 

Recently, I've started to have problems with its stability, and after 2 weeks 
of swapping hardware, found that using an earlier kernel restores its 
stability! 

It takes a few days to determine if anything "goes south", so debugging is 
very, very slow. But I get random read errors, either SCSI errors or (a few 
times) HDA read errors. 

Once the read errors begin, the system becomes very unresponsive, and often 
won't restart, even though I wait for hours, without my hitting the "kill 
switch". 

# uname -a 
Linux backuphost 2.6.9-67.0.22.EL #1 Wed Jul 23 17:17:45 EDT 2008 i686 athlon 
i386 GNU/Linux

The failures occur on all /dev/sd* devices, even those that are USB. Once, 
/dev/hdc had a similar problem after /dev/sdb had failed. Don't know if the 
mapping below helps? 

/dev/hda - PATA, on motherboard, 20 GB. 
/dev/hdb - IDE CDROM 
/dev/hdc - on motherboard 500 GB IDE
/dev/hdd - on motherboard 300 GB IDE
/dev/hde - on PCI card, 500 GB IDE 
/dev/sda - SATA, on a PCI card, 1 TB
/dev/sdb - SATA, on a PCI card 1 TB
/dev/sdc - USB on a USB 2.0 PCI card, 750 GB 
/dev/sde - USB on a USB 2.0 PCI card, 750 GB
/dev/sdf - USB on a USB 2.0 PCI card, 1 TB 


Here's what I see in the /var/log/messages:

May 27 05:08:42 hume ntpd[4844]: kernel time sync enabled 0001
May 27 08:01:01 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 08:01:01 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 08:01:01 hume kernel: EXT3-fs error (device sda1): ext3_find_entry: 
reading directory #2 offset 0
May 27 08:01:01 hume kernel:
May 27 08:14:27 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 08:14:27 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 08:14:27 hume kernel: EXT3-fs error (device sda1): ext3_find_entry: 
reading directory #2 offset 0
May 27 08:14:27 hume kernel:
May 27 10:28:30 hume ntpd[4844]: synchronized to 63.240.161.99, stratum 2
May 27 11:48:07 hume sshd(pam_unix)[26873]: session opened for user root by 
(uid=0)
May 27 11:48:10 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:10 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 11:48:10 hume kernel: EXT3-fs error (device sda1): ext3_find_entry: 
reading directory #2 offset 0
May 27 11:48:10 hume kernel:
May 27 11:48:16 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:16 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 11:48:16 hume kernel: EXT3-fs error (device sda1): ext3_readdir: 
directory #2 contains a hole at offset 0
May 27 11:48:23 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:23 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 11:48:23 hume kernel: EXT3-fs error (device sda1): ext3_readdir: 
directory #2 contains a hole at offset 0
May 27 11:48:24 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:24 hume kernel: end_request: I/O error, dev sda, sector 12847
May 27 11:48:24 hume kernel: EXT3-fs error (device sda1): ext3_readdir: 
directory #2 contains a hole at offset 0
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 0
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 0
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 8
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 1
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 16
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 2
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 24
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 3
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 32
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 4
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 40
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 5
May 27 11:48:38 hume kernel: SCSI error : <0 0 0 0> return code = 0x40000
May 27 11:48:38 hume kernel: end_request: I/O error, dev sda, sector 48
May 27 11:48:38 hume kernel: Buffer I/O error on device sda, logical block 6
.. MANY MEGABYTES OF STUFF LIKE THIS .. 

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20090603/4e665fda/attachment-0004.html>