[CentOS] C7, mdadm issues

Wed Jan 30 18:17:37 UTC 2019
Alessandro Baggi <alessandro.baggi at gmail.com>

Il 30/01/19 18:49, mark ha scritto:
> Alessandro Baggi wrote:
>> Il 30/01/19 16:33, mark ha scritto:
>>
>>> Alessandro Baggi wrote:
>>>
>>>> Il 30/01/19 14:02, mark ha scritto:
>>>>
>>>>> On 01/30/19 03:45, Alessandro Baggi wrote:
>>>>>
>>>>>> Il 29/01/19 20:42, mark ha scritto:
>>>>>>
>>>>>>> Alessandro Baggi wrote:
>>>>>>>
>>>>>>>> Il 29/01/19 18:47, mark ha scritto:
>>>>>>>>
>>>>>>>>> Alessandro Baggi wrote:
>>>>>>>>>
>>>>>>>>>> Il 29/01/19 15:03, mark ha scritto:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I've no idea what happened, but the box I was working
>>>>>>>>>>> on last week has a *second* bad drive. Actually, I'm
>>>>>>>>>>> starting to wonder about that particulare hot-swap bay.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've
>>>>>>>>>>> added /dev/sdi1...
>>>>>>>>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have
>>>>>>>>>>> yet to find a reliable way to make either one active.
>>>>>>>>>>>
>>>>>>>>>>> Actually, I would have expected the linux RAID to
>>>>>>>>>>> replace a failed one with a spare....
>>>>>>>
>>>>>>>>>> can you report your raid configuration like raid level
>>>>>>>>>> and raid devices and the current status from /proc/mdstat?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Well, nope. I got to the point of rebooting the system (xfs
>>>>>>>>> had the RAID volume, and wouldn't let go; I also commented
>>>>>>>>> out the RAID volume.
>>>>>>>>>
>>>>>>>>> It's RAID 5, /dev/sdb *also* appears to have died. If I do
>>>>>>>>> mdadm --assemble --force -v /dev/md0  /dev/sd[cefgdh]1
>>>>>>>>> mdadm:
>>>>>>>>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is
>>>>>>>>> identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1
>>>>>>>>> is identified as a member of /dev/md0, slot -1. mdadm:
>>>>>>>>> /dev/sde1 is identified as a member of /dev/md0, slot
>>>>>>>>> 2.
>>>>>>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot
>>>>>>>>> 3.
>>>>>>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot
>>>>>>>>> 4.
>>>>>>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot
>>>>>>>>> -1.
>>>>>>>>> mdadm: no uptodate device for slot 1 of /dev/md0
>>>>>>>>> mdadm: added /dev/sde1 to /dev/md0 as 2
>>>>>>>>> mdadm: added /dev/sdf1 to /dev/md0 as 3
>>>>>>>>> mdadm: added /dev/sdg1 to /dev/md0 as 4
>>>>>>>>> mdadm: no uptodate device for slot 5 of /dev/md0
>>>>>>>>> mdadm: added /dev/sdd1 to /dev/md0 as -1
>>>>>>>>> mdadm: added /dev/sdh1 to /dev/md0 as -1
>>>>>>>>> mdadm: added /dev/sdc1 to /dev/md0 as 0
>>>>>>>>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not
>>>>>>>>> enough to start the array.
>>>>>>>>>
>>>>>>>>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both
>>>>>>>>> are spares.
>>>>>>>> Hi Mark,
>>>>>>>> please post the result from
>>>>>>>>
>>>>>>>> cat /sys/block/md0/md/sync_action
>>>>>>>
>>>>>>> There is none. There is no /dev/md0. mdadm refusees, saying
>>>>>>> that it's lost too many drives.
>>>>>>>
>>>>>>>         mark
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> CentOS mailing list
>>>>>>> CentOS at centos.org
>>>>>>> https://lists.centos.org/mailman/listinfo/centos
>>>>>>>
>>>>>>
>>>>>> I suppose that your config is 5 drive and 1 spare with 1 drive
>>>>>> failed. It's strange that your spare was not used for resync. Then
>>>>>> you added a new drive but it does not start because it marks the
>>>>>> new disk as spare and you have a raid5 with 4 devices and 2
>>>>>> spares.
>>>>>>
>>>>>> First I hope that you have a backup for all your data and don't
>>>>>> run some exotic command before backupping your data. If you can't
>>>>>> backup your data, it's a problem.
>>>>>
>>>>> This is at work. We have automated nightly backups, and I do
>>>>> offline backups of the backups every two weeks.
>>>>>>
>>>>>> Have you tried to remove the last added device sdi1 and restart
>>>>>> the raid and force to start a resync?
>>>>>
>>>>> The thing is, it had one? two? spares when /dev/sdb1 started dying,
>>>>> and it didn't use them.
>>>>>>
>>>>>> Have you tried to remove this 2 devices and re-add only the
>>>>>> device that will be usefull for resync?  Maybe you can set 5
>>>>>> devices for your raid and not 6, if it works (after resync) you
>>>>>> can add your spare device growing your raid set.
>>>>>
>>>>> I tried, and that's when I lost it (again), and it refuses to
>>>>> assemble/start the RAID "not enough devices".
>>>>>>
>>>>>> Reading on google many users use --zero-superblock before re-add
>>>>>> the device.
>>>>>
>>>>> I can take one out, and re-add, but I think I'm going to have to
>>>>> recreate the RAID again, and again restore from backup.
>>>>>>
>>>>>> Other user reassemble the raid using --assume-clean but I don't
>>>>>> know what effect it will produces
>>>>
>>>> Hope that someone give you a better help for this.
>>>>
>>>>
>>>> Update here if you got the solution.
>>>>
>>>>
>>>
>>> Not that I'm into American football, but I seem to have pulled off what
>>> I
>>> understand is called a hail-mary: *without* zeroing the superrblocks, I
>>> did a create with all six good drives, excluding /dev/sdb1, and
>>> explicitly told it one spare.
>>>
>>> And the array is there, complete with data, with *one* spare, five good
>>>   drives, and it's currently rebuilding the spare.
>>>
>>> The last resort worked, though we'll see how long.
>>>
>> So you have recreated the array without faulty device?
>>
> Yep.
> mdadm --create --verbose /dev/md0 --level=5 --raid-devices=6 /dev/sd[cdefgh]1
> 
> It's currently at 2.2% recovered for the extra drive.
> 
>       mark
> 
> 

How many TB?