[CentOS] Software RAID1 with CentOS-6.2

Wed Feb 29 00:27:53 UTC 2012
Kahlil Hodgson <kahlil.hodgson at dealmax.com.au>

Hello,

Having a problem with software RAID that is driving me crazy.

Here's the details:

1. CentOS 6.2 x86_64 install from the minimal iso (via pxeboot).
2. Reasonably good PC hardware (i.e. not budget, but not server grade either)
with a pair of 1TB Western Digital SATA3 Drives.
3. Drives are plugged into the SATA3 ports on the mainboard (both drives and
cables say they can do 6Gb/s).
4. During the install I set up software RAID1 for the two drives with two raid
partitions:
    md0 - 500M for /boot 
    md1 - "the rest" for a physical volume 
5. Setup LVM on md1 in the standard slash, swap, home layout

Install goes fine (actually really fast) and I reboot into CentoS 6.2.  Next I
ran yum update, added a few minor packages and performed some basic
configuration.

Now I start to get I/O errors on printed on the console.  Run 'mdadm -D
/dev/md1' and see the array is degraded and /dev/sdb2 has been marked as
faulty.

Okay, fair enough, I've got at least one bad drive.  I boot the system from a
live usb and run the short and long SMART tests on both drive.  No problems
reported but I know that can be misleading, so I'm going to have to gather some
evidence before I try to return these drives.  I run badblocks in destructive
mode on both drives as follows

    badblocks -w -b 4096 -c 98304 -s /dev/sda
    badblocks -w -b 4096 -c 98304 -s /dev/sdb

Come back the next day and see that no errors are reported. Er thats odd.  I
check the SMART data in case badblocks activity has triggered something.
Nope.  Maybe I screwed up the install somehow?

So I start again and repeat the install process very carefully.  This time I
check the raid array straight after boot.

    mdadm -D /dev/md0   -   all is fine.
    mdadm -D /dev/md1   -   the two drives are resyncing.

Okay, that is odd. The RAID1 array was created at the start of the install
process, before any software was installed. Surely it should be in sync
already?  Googled a bit and found a post were someone else had seen same thing
happen.  The advice was to just wait until the drives sync so the 'blocks
match exactly' but I'm not really happy with the explanation.  At this rate
its going to take a whole day to do a single minimal install and I'm sure I
would have heard others complaining about the process.

Anyway, I leave the system to sync for the rest of the day.  When I get back to
it I see the same (similar) I/O errors on the console and mdadm shows the RAID
array is degraded, /dev/sdb2 has been marked as faulty.  This time I notice
that the I/O errors all refer to /dev/sda.  Have to reboot because the fs is
now readonly.  When the system comes back up, its trying to resync the drive
again. Eh?

Any ideas what is going on here? If its bad drives, I really need some
confirmation independent of the software raid failing. I thought SMART or
badblocks give me that. Perhaps it has nothing to do with the drives.  Could a
problem with the mainboard or the memory cause this issue?  Is it a SATA3
issue?  Should I try it on the 3Gb/s channels since there's probably little
speed difference with non-SSDs? 

Cheers,

Kal