[CentOS] Software RAID Level 1, smartd and changing dev numbers

Wed Feb 16 18:41:03 UTC 2011
Robert Heller <heller at deepsoft.com>

At Wed, 16 Feb 2011 12:38:53 -0500 (EST) CentOS mailing list <centos at centos.org> wrote:

> 
> > At Wed, 16 Feb 2011 12:00:27 -0500 (EST) CentOS mailing list
> > <centos at centos.org> wrote:
> >
> >>
> >> We have about 50 CentOS servers with software RAID level 1 (mirroring).
> >> Each week, we swap out one of the drives (the one in the second of four
> >> hot-swap bays, only the first two of which contain drives) on each
> >> server
> >> and take them offsite for safekeeping.
> >>
> >> The problem is, the kernel seemingly randomly switches between /dev/sdb
> >> and /dev/sdc for these devices.  This makes the process slower by
> >> requiring more manual input where a script(s) could otherwise suffice.
> >
> > I'm assuming these are actually SATA disks with a controller that
> > supports hot-swap.
> 
> Correct.
> 
> > What I think is happening is that the kernel retains some 'memory' of
> > the pulled drive (say /dev/sdb) and when the fresh drive is installed, a
> > new dev file is created (/dev/sdc).  Eventually, /dev/sdb is forgotten
> > by the time the next 'swap' and /dev/sdb is assigned to the next fresh
> > disk.
> 
> Interesting...one would think that this behavior would be consistent
> across all servers then, but it isn't.  Most accept the same dev,
> /dev/sdb, but some assign /dev/sdc.  Is there a way to just disable
> /dev/sdc and force the kernel to use /dev/sdb every time?

It could be something as simple as 'timing'.  Like how long it takes for
the kernel to get around to re-cycling the device objects.  I would also
look real closely at the *exact* order of tasks (mdadm -f ..., mdadm -r
..) and how much time there is between these tasks and how 'busy' the
specific machine is.  It could be that the disk is being pulled too soon
or not enough time is left between the 'fail' and the 'remove' -- that
is the kernel is still doing something with the disk (eg has some
'unfinished business') and is thus not releasing the device object. It
is likely that the amount of time needed for things to 'settle' will
vary based on things like system load and just what the system is doing
(eg a database server will be different from a file server which will be
different from a DNS server, etc.).  And it might also depend on the
size of the disks and the type of controller (and the driver it uses).

> 
> > Question: are you always swapping in a *new* disk each week or
> > re-inserting the disk from the previous week?
> 
> It's a rotation, so re-inserting from the previous week.

Umm.  It has been stated elsewhere, but RAID is not really a substistute
for proper backups.

> 
> >>
> >> It also confuses smartd, which AFAIK, needs the correct device names to
> >> report accurately.
> >>
> >> Ideally, we'd like to force the OS at some level to always see these
> >> devices as /dev/sda and /dev/sdb.  If not, is there at least some way to
> >> configure smartd to be "smart" and recognize which devices are in use?
> >
> > The cure might be that you need to do a reboot to properly rescan the
> > disks.
> 
> Ugh.  Thanks for your reponse.
> 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
> 
>                                               

-- 
Robert Heller             -- 978-544-6933 / heller at deepsoft.com
Deepwoods Software        -- http://www.deepsoft.com/
()  ascii ribbon campaign -- against html e-mail
/\  www.asciiribbon.org   -- against proprietary attachments