On 02/02/2011 09:00 AM, Lamar Owen wrote:
On Wednesday, February 02, 2011 02:06:15 am Chuck Munro wrote:
The real key is to carefully label each SATA cable and its associated drive. Then the little mapping script can be used to identify the faulty drive which mdadm reports by its device name. It just occurred to me that whenever mdadm sends an email report, it can also run a script which groks out the path info and puts it in the email message. Problem solved:-)
Ok, perhaps I'm dense, but, if this is not a hot-swap bay you're talking about, wouldn't it be easier to have the drive's serial number (or other identifier found on the label) pulled into the e-mail, and compare with the label physically found on the drive, since you're going to have to open the case anyway? Using something like:
hdparm -I $DEVICE | grep Serial.Number
works here (the regexp Serial.Number matches the string "Serial Number" without requiring the double quotes....). Use whatever $DEVICE you need to use, as long as it's on a controller compatible with hdparm usage.
I have seen cases with a different Linux distribution where the actual module load order was nondeterministic (modules loaded in parallel); while upstream and the CentOS rebuild try to make things more deterministic, wouldn't it be safer to get a really unique, externally visible identifier from the drive? If the drive has failed to the degree that it won't respond to the query, then query all the good drives in the array for their serial numbers, and use a process of elimination. This, IMO, is more robust than relying on the drive detect order to remain deterministic.
If in a hotswap or coldswap bay, do some data access to the array, and see which LED's don't blink; that should correspond to the failed drive. If the bay has secondary LED's, you might be able to blink those, too.
Well no, you're not being dense. It's a case of making the best of what the physical hardware can do for me. In my case, the drives are segregated into several 3-drive bays which are bolted into the case individually, so removing each one to compare serial numbers would be a major pain, since I'd have to unbolt a bay and remove each drive one at a time to read the label.
The use of the new RHEL-6/CentOS-6 'udevadm' command nicely maps out the hardware path no matter the order the drives are detected/named, and since hardware paths are fixed, I just have to attach a little tag to each SATA cable with that path number on it. One thing I did was reboot the machine *many* times to make sure the controller cards were always enumerated by Linux in the same slot order.
I also notice that the RHEL-6 DriveInfo GUI application shows which drive is giving trouble, but it only maps the controllers in a vague way with respect to the hardware path. (At least that's what I remember seeing a couple of days ago, I could be mistaken.)
On this particular machine I don't have the luxury of per-drive LED activity indicators, so whacking each drive with a big read won't point the way (but I have used that technique on other machines). I didn't have the funds to buy the hot-swap bays I would have preferred. I may retrofit later.
Your suggestions are well taken, but the hardware I have doesn't readily allow my use of them. Thanks for the ideas.
Chuck
On 2/2/11 5:57 PM, Chuck Munro wrote:
The use of the new RHEL-6/CentOS-6 'udevadm' command nicely maps out the hardware path no matter the order the drives are detected/named, and since hardware paths are fixed, I just have to attach a little tag to each SATA cable with that path number on it. One thing I did was reboot the machine *many* times to make sure the controller cards were always enumerated by Linux in the same slot order.
I think there are ways that drives can fail that would make them not be detected at all - and for an autodetected raid member in a system that has been rebooted, not leave much evidence of where it was when it worked. If your slots are all full you may still be able to figure it out but it might be a good idea to save a copy of the listing when you know everything is working.
On Wednesday, February 02, 2011 08:04:43 pm Les Mikesell wrote:
I think there are ways that drives can fail that would make them not be detected at all - and for an autodetected raid member in a system that has been rebooted, not leave much evidence of where it was when it worked. If your slots are all full you may still be able to figure it out but it might be a good idea to save a copy of the listing when you know everything is working.
I'll echo this advice.
I guess I'm spoiled to my EMC arrays, which light a yellow LED on the DAE and on the individual drive, as well as telling you which backend bus, which enclosure, and which drive in that enclosure. And the EMC-custom firmware is paranoid about errors.
But my personal box is a used SuperMicro dual Xeon I got at the depth of the recession in December 2009, and paid a song and half a dance for it. It had the six bay hotswap SCSI, and I replaced it with the six bay hotswap SATA, put in a used (and cheap) 3Ware 9500S controller, and have a RAID5 of four 250GB drives for the boot and root volumes, and an MD RAID1 pair of 750GB drives for /home. The Supermicro motherboard didn't have SATA ports, but I got a 64-bit PCI-X dual internal SATA/dual eSATA low-profile board with the low-profile bracket to fit the 2U case. Total cost <$500.