Les Mikesell wrote:
On 1/30/11 1:37 PM, Chuck Munro wrote:
Hello list members,
My adventure into udev rules has taken an interesting turn. I did discover a stupid error in the way I was attempting to assign static disk device names on CentOS-5.5, so that's out of the way.
But in the process of exploring, I installed a trial copy of RHEL-6 on the new machine to see if anything had changed (since I intend this box to run CentOS-6 anyway).
Lots of differences, and it's obvious that RedHat does things a bit differently here and there. My focus has been on figuring out how best to solve my udev challenge, and I found that tools like 'scsi_id' and udev admin/test commands have changed. The udev rules themselves seem to be the same.
Do any of the names under /dev/disk/* work for your static identifiers? You should be able to use them directly instead of using udev to map them to something else, making it more obvious what you are doing. And are these names the same under RHEL6?
I was happy to see that device names (at least for SCSI disks) have not changed. The more I look into the whole problem the more I realize that I've overstated the difficulty, now that I know how to map out the hardware path for any given /dev/sdX I might need to replace. I've never dug as deeply into this before, mostly because I never could find the spare time.
I'm happy with simply writing a little script which accepts a /dev/sdX device name argument and reformats the output of: 'udevadm info --query=path --name=/dev/sdX' to extract the hardware path. It's a bit cleaner than the current RHEL-5/CentOS-5 'udevinfo' command.
Using the numeric path assumes knowledge of how the motherboard sockets are laid out and the order in which I/O controller channels are discovered, of course. It's then not difficult to trace a failed drive by attaching little tags to the SATA cables from the controller cards.
The real key is to carefully label each SATA cable and its associated drive. Then the little mapping script can be used to identify the faulty drive which mdadm reports by its device name. It just occurred to me that whenever mdadm sends an email report, it can also run a script which groks out the path info and puts it in the email message. Problem solved :-)
So even though I figured out how to add 'alias' symlink names to each disk drive, I'm not going to bother with it. It was a very useful learning experience, though.
Chuck
On Wednesday, February 02, 2011 02:06:15 am Chuck Munro wrote:
The real key is to carefully label each SATA cable and its associated drive. Then the little mapping script can be used to identify the faulty drive which mdadm reports by its device name. It just occurred to me that whenever mdadm sends an email report, it can also run a script which groks out the path info and puts it in the email message. Problem solved :-)
Ok, perhaps I'm dense, but, if this is not a hot-swap bay you're talking about, wouldn't it be easier to have the drive's serial number (or other identifier found on the label) pulled into the e-mail, and compare with the label physically found on the drive, since you're going to have to open the case anyway? Using something like:
hdparm -I $DEVICE | grep Serial.Number
works here (the regexp Serial.Number matches the string "Serial Number" without requiring the double quotes....). Use whatever $DEVICE you need to use, as long as it's on a controller compatible with hdparm usage.
I have seen cases with a different Linux distribution where the actual module load order was nondeterministic (modules loaded in parallel); while upstream and the CentOS rebuild try to make things more deterministic, wouldn't it be safer to get a really unique, externally visible identifier from the drive? If the drive has failed to the degree that it won't respond to the query, then query all the good drives in the array for their serial numbers, and use a process of elimination. This, IMO, is more robust than relying on the drive detect order to remain deterministic.
If in a hotswap or coldswap bay, do some data access to the array, and see which LED's don't blink; that should correspond to the failed drive. If the bay has secondary LED's, you might be able to blink those, too.