[CentOS] C7, mdadm issues

Wed Jan 30 15:33:44 UTC 2019
mark <m.roth at 5-cent.us>

Alessandro Baggi wrote:
> Il 30/01/19 14:02, mark ha scritto:
>> On 01/30/19 03:45, Alessandro Baggi wrote:
>>> Il 29/01/19 20:42, mark ha scritto:
>>>> Alessandro Baggi wrote:
>>>>> Il 29/01/19 18:47, mark ha scritto:
>>>>>> Alessandro Baggi wrote:
>>>>>>> Il 29/01/19 15:03, mark ha scritto:
>>>>>>>
>>>>>>>> I've no idea what happened, but the box I was working on
>>>>>>>> last week has a *second* bad drive. Actually, I'm starting
>>>>>>>> to wonder about that particulare hot-swap bay.
>>>>>>>>
>>>>>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added
>>>>>>>> /dev/sdi1...
>>>>>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet
>>>>>>>> to find a reliable way to make either one active.
>>>>>>>>
>>>>>>>> Actually, I would have expected the linux RAID to replace a
>>>>>>>> failed one with a spare....
>>>>
>>>>>>> can you report your raid configuration like raid level and
>>>>>>> raid devices and the current status from /proc/mdstat?
>>>>>>>
>>>>>> Well, nope. I got to the point of rebooting the system (xfs had
>>>>>> the RAID
>>>>>> volume, and wouldn't let go; I also commented out the RAID
>>>>>> volume.
>>>>>>
>>>>>> It's RAID 5, /dev/sdb *also* appears to have died. If I do
>>>>>> mdadm --assemble --force -v /dev/md0  /dev/sd[cefgdh]1 mdadm:
>>>>>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified
>>>>>> as a member of /dev/md0, slot 0.
>>>>>> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1.
>>>>>>  mdadm: /dev/sde1 is identified as a member of /dev/md0, slot
>>>>>> 2.
>>>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3.
>>>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4.
>>>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1.
>>>>>>  mdadm: no uptodate device for slot 1 of /dev/md0
>>>>>> mdadm: added /dev/sde1 to /dev/md0 as 2
>>>>>> mdadm: added /dev/sdf1 to /dev/md0 as 3
>>>>>> mdadm: added /dev/sdg1 to /dev/md0 as 4
>>>>>> mdadm: no uptodate device for slot 5 of /dev/md0
>>>>>> mdadm: added /dev/sdd1 to /dev/md0 as -1
>>>>>> mdadm: added /dev/sdh1 to /dev/md0 as -1
>>>>>> mdadm: added /dev/sdc1 to /dev/md0 as 0
>>>>>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not
>>>>>> enough to start the array.
>>>>>>
>>>>>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both are
>>>>>> spares.
>>>>> Hi Mark,
>>>>> please post the result from
>>>>>
>>>>> cat /sys/block/md0/md/sync_action
>>>>
>>>> There is none. There is no /dev/md0. mdadm refusees, saying that
>>>> it's lost too many drives.
>>>>
>>>>        mark
>>>>
>>>>
>>>> _______________________________________________
>>>> CentOS mailing list
>>>> CentOS at centos.org
>>>> https://lists.centos.org/mailman/listinfo/centos
>>>
>>> I suppose that your config is 5 drive and 1 spare with 1 drive
>>> failed. It's strange that your spare was not used for resync.
>>> Then you added a new drive but it does not start because it marks the
>>> new disk as spare and you have a raid5 with 4 devices and 2 spares.
>>>
>>> First I hope that you have a backup for all your data and don't run
>>> some exotic command before backupping your data. If you can't backup
>>> your data, it's a problem.
>>
>> This is at work. We have automated nightly backups, and I do offline
>> backups of the backups every two weeks.
>>>
>>> Have you tried to remove the last added device sdi1 and restart the
>>> raid and force to start a resync?
>>
>> The thing is, it had one? two? spares when /dev/sdb1 started dying, and
>>  it didn't use them.
>>>
>>> Have you tried to remove this 2 devices and re-add only the device
>>> that will be usefull for resync?  Maybe you can set 5 devices for your
>>>  raid and not 6, if it works (after resync) you can add your spare
>>> device growing your raid set.
>>
>> I tried, and that's when I lost it (again), and it refuses to
>> assemble/start the RAID "not enough devices".
>>>
>>> Reading on google many users use --zero-superblock before re-add the
>>> device.
>>
>> I can take one out, and re-add, but I think I'm going to have to
>> recreate the RAID again, and again restore from backup.
>>>
>>> Other user reassemble the raid using --assume-clean but I don't know
>>> what effect it will produces
>
> Hope that someone give you a better help for this.
>
> Update here if you got the solution.
>

Not that I'm into American football, but I seem to have pulled off what I
understand is called a hail-mary: *without* zeroing the superrblocks, I
did a create with all six good drives, excluding /dev/sdb1, and explicitly
told it one spare.

And the array is there, complete with data, with *one* spare, five good
drives, and it's currently rebuilding the spare.

The last resort worked, though we'll see how long.

        mark