On 2013-04-12, David Miller millerdc@fusion.gat.com wrote:
On Apr 11, 2013, at 5:25 PM, John R Pierce pierce@hogranch.com wrote:
yeah, until a disk fails on a 40 disk array and the chassis LEDs on the backplane don't light up to indicate which disk it is and your operations monkey pulls the wrong one and crash the whole raid.
[snip]
You simply match up the Linux /dev/sdX designation with the drives serial number using smartctl. When I first bring the array online I have a script that greps out the drives serial numbers from smartctl and creates a neat text file with the mappings. When either smartd or md complain about a drive I remove the drive from the RAID using mdadm and then pull the drive based on the mapping file. Drive 0 in those SuperMicro SAS/SATA arrays are always the lowest drive letter and goes up from there. If a drive is replaced I just update the text file accordingly. You can also print out the drive serial numbers and put them on the front of the removable drive cages. It is not as elegant as a blinking LED but it works just as well. I have been doing it like this for 6 plus years now with a few dozen SuperMicro arrays. I have never pulled a wrong drive.
I think that there is at least one potential problem, and possibly more, with your method.
1) It only takes once forgetting to update the mapping file to screw things up for yourself. Some people are the type who will never forget to do that. I'm (unfortunately) not. (Actually, I guess it takes twice, since if you have only one slot not up to date, you could use the serial numbers to map all but the one drive, and that's the suspect drive. I wouldn't want to trust that process.)
2) Drive assignments can be dynamic. If you pull the tray in port 0, which was sda (for example), you're not necessarily guaranteed that the replacement drive will be sda. It might be assigned the next available sdX. I have seen this in certain failure situations. (As an aside, how does the kernel handle more than 26 hard drive devices? sdaa? sdA?)
1a and 2a) Printing serial numbers and taping them to the tray is much less error-prone, but also more time consuming. If you have a label printer that certainly makes things easier.
3) If you have someone else pulling drives for you, they may not have access to the mapping file, and/or may not be willing or under contract to print a new tray label and replace it. It's way less error-prone to tell an "operations monkey" to pull the blinky drive than to hope you read the mapping file correctly, and relay the correct location to the monkey. (The ops monkey may not have login rights on your server, so you also can't rely on him being able to look at the mapping file himself.) If you're the only person who will ever pull drives, this isn't such a huge problem.
That's not to say that your methods can't work--obviously they can if you haven't had any mistakes in many years. But the combination of a BBU-backed write cache and an identify blink makes a dedicated hardware RAID controller a big win for me. (I do also use md RAID, even on hardware RAID controllers, where flexibility and portability are more important than performance.)
--keith