[CentOS] Software RAID Level 1, smartd and changing dev numbers

Wed Feb 16 18:47:16 UTC 2011
James Smallacombe <james at sicom.com>

> At Wed, 16 Feb 2011 12:38:53 -0500 (EST) CentOS mailing list
> <centos at centos.org> wrote:
>
>>
>> > At Wed, 16 Feb 2011 12:00:27 -0500 (EST) CentOS mailing list
>> > <centos at centos.org> wrote:
>> >
>> >>
>> >> We have about 50 CentOS servers with software RAID level 1
>> (mirroring).
>> >> Each week, we swap out one of the drives (the one in the second of
>> four
>> >> hot-swap bays, only the first two of which contain drives) on each
>> >> server
>> >> and take them offsite for safekeeping.
>> >>
>> >> The problem is, the kernel seemingly randomly switches between
>> /dev/sdb
>> >> and /dev/sdc for these devices.  This makes the process slower by
>> >> requiring more manual input where a script(s) could otherwise
>> suffice.
>> >
>> > I'm assuming these are actually SATA disks with a controller that
>> > supports hot-swap.
>>
>> Correct.
>>
>> > What I think is happening is that the kernel retains some 'memory' of
>> > the pulled drive (say /dev/sdb) and when the fresh drive is installed,
>> a
>> > new dev file is created (/dev/sdc).  Eventually, /dev/sdb is forgotten
>> > by the time the next 'swap' and /dev/sdb is assigned to the next fresh
>> > disk.
>>
>> Interesting...one would think that this behavior would be consistent
>> across all servers then, but it isn't.  Most accept the same dev,
>> /dev/sdb, but some assign /dev/sdc.  Is there a way to just disable
>> /dev/sdc and force the kernel to use /dev/sdb every time?
>
> It could be something as simple as 'timing'.  Like how long it takes for
> the kernel to get around to re-cycling the device objects.  I would also
> look real closely at the *exact* order of tasks (mdadm -f ..., mdadm -r
> ..) and how much time there is between these tasks and how 'busy' the
> specific machine is.  It could be that the disk is being pulled too soon
> or not enough time is left between the 'fail' and the 'remove' -- that
> is the kernel is still doing something with the disk (eg has some
> 'unfinished business') and is thus not releasing the device object. It
> is likely that the amount of time needed for things to 'settle' will
> vary based on things like system load and just what the system is doing
> (eg a database server will be different from a file server which will be
> different from a DNS server, etc.).  And it might also depend on the
> size of the disks and the type of controller (and the driver it uses).

Interesting...I will discuss with the tech who swaps the drives out.

>> > Question: are you always swapping in a *new* disk each week or
>> > re-inserting the disk from the previous week?
>>
>> It's a rotation, so re-inserting from the previous week.
>
> Umm.  It has been stated elsewhere, but RAID is not really a substistute
> for proper backups.

I agree.  Proper archiving is also in place.  This system is also in
place, to allow for a faster recovery in the event of other hardware
failure.  It has been useful many times already.