[CentOS] [OT] RAID 6 - opinions

Fri Apr 12 05:48:38 UTC 2013
Keith Keller <kkeller at wombat.san-francisco.ca.us>

On 2013-04-12, David Miller <millerdc at fusion.gat.com> wrote:
> On Apr 11, 2013, at 5:25 PM, John R Pierce <pierce at hogranch.com> wrote:
>> yeah, until a disk fails on a 40 disk array and the chassis LEDs on the 
>> backplane don't light up to indicate which disk it is and your 
>> operations monkey pulls the wrong one and crash the whole raid.


> You simply match up the Linux /dev/sdX designation with the drives serial number using smartctl. When I first bring the array online I have a script that greps out the drives serial numbers from smartctl and creates a neat text file with the mappings. When either smartd or md complain about a drive I remove the drive from the RAID using mdadm and then pull the drive based on the mapping file. Drive 0 in those SuperMicro SAS/SATA arrays are always the lowest drive letter and goes up from there. If a drive is replaced I just update the text file accordingly. You can also print out the drive serial numbers and put them on the front of the removable drive cages. It is not as elegant as a blinking LED but it works just as well.  I have been doing it like this for 6 plus years now with a few dozen SuperMicro arrays. I have never pulled a wrong drive.  

I think that there is at least one potential problem, and possibly more,
with your method.

1) It only takes once forgetting to update the mapping file to screw
things up for yourself.  Some people are the type who will never forget
to do that.  I'm (unfortunately) not.  (Actually, I guess it takes
twice, since if you have only one slot not up to date, you could use the
serial numbers to map all but the one drive, and that's the suspect
drive.  I wouldn't want to trust that process.)

2) Drive assignments can be dynamic.  If you pull the tray in port 0,
which was sda (for example), you're not necessarily guaranteed that the
replacement drive will be sda.  It might be assigned the next available
sdX.  I have seen this in certain failure situations.  (As an aside, how
does the kernel handle more than 26 hard drive devices?  sdaa?  sdA?)

1a and 2a) Printing serial numbers and taping them to the tray is much
less error-prone, but also more time consuming.  If you have a label
printer that certainly makes things easier.

3) If you have someone else pulling drives for you, they may not have
access to the mapping file, and/or may not be willing or under contract
to print a new tray label and replace it.  It's way less error-prone to
tell an "operations monkey" to pull the blinky drive than to hope you
read the mapping file correctly, and relay the correct location to the
monkey.  (The ops monkey may not have login rights on your server, so
you also can't rely on him being able to look at the mapping file
himself.)  If you're the only person who will ever pull drives, this
isn't such a huge problem.

That's not to say that your methods can't work--obviously they can if
you haven't had any mistakes in many years.  But the combination of a
BBU-backed write cache and an identify blink makes a dedicated hardware
RAID controller a big win for me.  (I do also use md RAID, even on
hardware RAID controllers, where flexibility and portability are more
important than performance.)


kkeller at wombat.san-francisco.ca.us