[CentOS] Race condition with mdadm at boot [still mystifying]

Fri Mar 11 03:25:22 UTC 2011
Chuck Munro <chuckm at seafoam.net>

This is a bit long-winded, but I wanted to share some info ....

Regarding my earlier message about a possible race condition with mdadm, 
I have been doing all sorts of poking around with the boot process. 
Thanks to a tip from Steven Yellin at Stanford, I found where to add a 
delay in the rc.sysinit script, which invokes mdadm to assemble the arrays.

Unfortunately it didn't help, so it likely wasn't a race condition after 
all.

However, on close examination of dmesg, I found something very 
interesting.  There were missing 'bind<sd??>' statements for one or the 
other hot spare drive (or sometimes both).  These drives are connected 
to the last PHYs in each SATA controller ... in other words they are the 
last devices probed by the driver for a particular controller.  It would 
appear that the drivers are bailing out before managing to enumerate all 
of the partitions on the last drive in a group, and missing partitions 
occur quite randomly.

So it may or may not be a timing issue between the WD Caviar Black 
drives and both the LSI and Marvell SAS/SATA controller chips.

So, I replaced the two drives (SATA-300) with two faster drives 
(SATA-600) on the off chance they might respond fast enough before the 
drivers move on to other duties.  That didn't help either.

Each group of arrays uses completely drivers (mptsas and sata_mv) but 
both exhibit the same problem, so I'm mystified as to where the real 
issue lies.  Anyone care to offer suggestions?

Chuck