I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
Clues for the poor? I *really* don't want to freak out the user by taking it down, and building yet another array.
mark
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
Clues for the poor? I *really* don't want to freak out the user by taking it down, and building yet another array.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hi Mark, can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Thank you.
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
Clues for the poor? I *really* don't want to freak out the user by taking it down, and building yet another array.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hi Mark, can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
mark
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
Clues for the poor? I *really* don't want to freak out the user by taking it down, and building yet another array.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hi Mark, can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
Reading on google many users use --zero-superblock before re-add the device.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that this helps.
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
I've no idea what happened, but the box I was working on last week has a *second* bad drive. Actually, I'm starting to wonder about that particulare hot-swap bay.
Anyway, mdadm --detail shows /dev/sdb1 remove. I've added /dev/sdi1... but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find a reliable way to make either one active.
Actually, I would have expected the linux RAID to replace a failed one with a spare....
can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that this helps.
Thanks.
mark
Il 30/01/19 14:02, mark ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
> I've no idea what happened, but the box I was working on last week > has a *second* bad drive. Actually, I'm starting to wonder about > that particulare hot-swap bay. > > Anyway, mdadm --detail shows /dev/sdb1 remove. I've added > /dev/sdi1... > but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find > a reliable way to make either one active. > > Actually, I would have expected the linux RAID to replace a failed > one with a spare....
can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that this helps.
Thanks.
mark
Hope that someone give you a better help for this.
Update here if you got the solution.
Alessandro Baggi wrote:
Il 30/01/19 14:02, mark ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote: > Il 29/01/19 15:03, mark ha scritto: > >> I've no idea what happened, but the box I was working on >> last week has a *second* bad drive. Actually, I'm starting >> to wonder about that particulare hot-swap bay. >> >> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >> /dev/sdi1... >> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet >> to find a reliable way to make either one active. >> >> Actually, I would have expected the linux RAID to replace a >> failed one with a spare....
> can you report your raid configuration like raid level and > raid devices and the current status from /proc/mdstat? > Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that someone give you a better help for this.
Update here if you got the solution.
Not that I'm into American football, but I seem to have pulled off what I understand is called a hail-mary: *without* zeroing the superrblocks, I did a create with all six good drives, excluding /dev/sdb1, and explicitly told it one spare.
And the array is there, complete with data, with *one* spare, five good drives, and it's currently rebuilding the spare.
The last resort worked, though we'll see how long.
mark
Il 30/01/19 16:33, mark ha scritto:
Alessandro Baggi wrote:
Il 30/01/19 14:02, mark ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto: > Alessandro Baggi wrote: >> Il 29/01/19 15:03, mark ha scritto: >> >>> I've no idea what happened, but the box I was working on >>> last week has a *second* bad drive. Actually, I'm starting >>> to wonder about that particulare hot-swap bay. >>> >>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >>> /dev/sdi1... >>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet >>> to find a reliable way to make either one active. >>> >>> Actually, I would have expected the linux RAID to replace a >>> failed one with a spare....
>> can you report your raid configuration like raid level and >> raid devices and the current status from /proc/mdstat? >> > Well, nope. I got to the point of rebooting the system (xfs had > the RAID > volume, and wouldn't let go; I also commented out the RAID > volume. > > It's RAID 5, /dev/sdb *also* appears to have died. If I do > mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: > looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified > as a member of /dev/md0, slot 0. > mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. > mdadm: /dev/sde1 is identified as a member of /dev/md0, slot > 2. > mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. > mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. > mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. > mdadm: no uptodate device for slot 1 of /dev/md0 > mdadm: added /dev/sde1 to /dev/md0 as 2 > mdadm: added /dev/sdf1 to /dev/md0 as 3 > mdadm: added /dev/sdg1 to /dev/md0 as 4 > mdadm: no uptodate device for slot 5 of /dev/md0 > mdadm: added /dev/sdd1 to /dev/md0 as -1 > mdadm: added /dev/sdh1 to /dev/md0 as -1 > mdadm: added /dev/sdc1 to /dev/md0 as 0 > mdadm: /dev/md0 assembled from 4 drives and 2 spares - not > enough to start the array. > > --examine shows me /dev/sdd1 and /dev/sdh1, but that both are > spares. Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that someone give you a better help for this.
Update here if you got the solution.
Not that I'm into American football, but I seem to have pulled off what I understand is called a hail-mary: *without* zeroing the superrblocks, I did a create with all six good drives, excluding /dev/sdb1, and explicitly told it one spare.
And the array is there, complete with data, with *one* spare, five good drives, and it's currently rebuilding the spare.
The last resort worked, though we'll see how long.
mark
So you have recreated the array without faulty device?
Alessandro Baggi wrote:
Il 30/01/19 16:33, mark ha scritto:
Alessandro Baggi wrote:
Il 30/01/19 14:02, mark ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
> Il 29/01/19 18:47, mark ha scritto: > >> Alessandro Baggi wrote: >> >>> Il 29/01/19 15:03, mark ha scritto: >>> >>> >>>> I've no idea what happened, but the box I was working >>>> on last week has a *second* bad drive. Actually, I'm >>>> starting to wonder about that particulare hot-swap bay. >>>> >>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've >>>> added /dev/sdi1... >>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have >>>> yet to find a reliable way to make either one active. >>>> >>>> Actually, I would have expected the linux RAID to >>>> replace a failed one with a spare....
>>> can you report your raid configuration like raid level >>> and raid devices and the current status from /proc/mdstat? >>> >>> >> Well, nope. I got to the point of rebooting the system (xfs >> had the RAID volume, and wouldn't let go; I also commented >> out the RAID volume. >> >> It's RAID 5, /dev/sdb *also* appears to have died. If I do >> mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 >> mdadm: >> looking for devices for /dev/md0 mdadm: /dev/sdc1 is >> identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 >> is identified as a member of /dev/md0, slot -1. mdadm: >> /dev/sde1 is identified as a member of /dev/md0, slot >> 2. >> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot >> 3. >> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot >> 4. >> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot >> -1. >> mdadm: no uptodate device for slot 1 of /dev/md0 >> mdadm: added /dev/sde1 to /dev/md0 as 2 >> mdadm: added /dev/sdf1 to /dev/md0 as 3 >> mdadm: added /dev/sdg1 to /dev/md0 as 4 >> mdadm: no uptodate device for slot 5 of /dev/md0 >> mdadm: added /dev/sdd1 to /dev/md0 as -1 >> mdadm: added /dev/sdh1 to /dev/md0 as -1 >> mdadm: added /dev/sdc1 to /dev/md0 as 0 >> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not >> enough to start the array. >> >> --examine shows me /dev/sdd1 and /dev/sdh1, but that both >> are spares. > Hi Mark, > please post the result from > > cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that someone give you a better help for this.
Update here if you got the solution.
Not that I'm into American football, but I seem to have pulled off what I understand is called a hail-mary: *without* zeroing the superrblocks, I did a create with all six good drives, excluding /dev/sdb1, and explicitly told it one spare.
And the array is there, complete with data, with *one* spare, five good drives, and it's currently rebuilding the spare.
The last resort worked, though we'll see how long.
So you have recreated the array without faulty device?
Yep. mdadm --create --verbose /dev/md0 --level=5 --raid-devices=6 /dev/sd[cdefgh]1
It's currently at 2.2% recovered for the extra drive.
mark
Il 30/01/19 18:49, mark ha scritto:
Alessandro Baggi wrote:
Il 30/01/19 16:33, mark ha scritto:
Alessandro Baggi wrote:
Il 30/01/19 14:02, mark ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
> Alessandro Baggi wrote: > >> Il 29/01/19 18:47, mark ha scritto: >> >>> Alessandro Baggi wrote: >>> >>>> Il 29/01/19 15:03, mark ha scritto: >>>> >>>> >>>>> I've no idea what happened, but the box I was working >>>>> on last week has a *second* bad drive. Actually, I'm >>>>> starting to wonder about that particulare hot-swap bay. >>>>> >>>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've >>>>> added /dev/sdi1... >>>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have >>>>> yet to find a reliable way to make either one active. >>>>> >>>>> Actually, I would have expected the linux RAID to >>>>> replace a failed one with a spare.... > >>>> can you report your raid configuration like raid level >>>> and raid devices and the current status from /proc/mdstat? >>>> >>>> >>> Well, nope. I got to the point of rebooting the system (xfs >>> had the RAID volume, and wouldn't let go; I also commented >>> out the RAID volume. >>> >>> It's RAID 5, /dev/sdb *also* appears to have died. If I do >>> mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 >>> mdadm: >>> looking for devices for /dev/md0 mdadm: /dev/sdc1 is >>> identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 >>> is identified as a member of /dev/md0, slot -1. mdadm: >>> /dev/sde1 is identified as a member of /dev/md0, slot >>> 2. >>> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot >>> 3. >>> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot >>> 4. >>> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot >>> -1. >>> mdadm: no uptodate device for slot 1 of /dev/md0 >>> mdadm: added /dev/sde1 to /dev/md0 as 2 >>> mdadm: added /dev/sdf1 to /dev/md0 as 3 >>> mdadm: added /dev/sdg1 to /dev/md0 as 4 >>> mdadm: no uptodate device for slot 5 of /dev/md0 >>> mdadm: added /dev/sdd1 to /dev/md0 as -1 >>> mdadm: added /dev/sdh1 to /dev/md0 as -1 >>> mdadm: added /dev/sdc1 to /dev/md0 as 0 >>> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not >>> enough to start the array. >>> >>> --examine shows me /dev/sdd1 and /dev/sdh1, but that both >>> are spares. >> Hi Mark, >> please post the result from >> >> cat /sys/block/md0/md/sync_action > > There is none. There is no /dev/md0. mdadm refusees, saying > that it's lost too many drives. > > mark > > > > _______________________________________________ > CentOS mailing list > CentOS@centos.org > https://lists.centos.org/mailman/listinfo/centos >
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
Have you tried to remove this 2 devices and re-add only the device that will be usefull for resync? Maybe you can set 5 devices for your raid and not 6, if it works (after resync) you can add your spare device growing your raid set.
I tried, and that's when I lost it (again), and it refuses to assemble/start the RAID "not enough devices".
Reading on google many users use --zero-superblock before re-add the device.
I can take one out, and re-add, but I think I'm going to have to recreate the RAID again, and again restore from backup.
Other user reassemble the raid using --assume-clean but I don't know what effect it will produces
Hope that someone give you a better help for this.
Update here if you got the solution.
Not that I'm into American football, but I seem to have pulled off what I understand is called a hail-mary: *without* zeroing the superrblocks, I did a create with all six good drives, excluding /dev/sdb1, and explicitly told it one spare.
And the array is there, complete with data, with *one* spare, five good drives, and it's currently rebuilding the spare.
The last resort worked, though we'll see how long.
So you have recreated the array without faulty device?
Yep. mdadm --create --verbose /dev/md0 --level=5 --raid-devices=6 /dev/sd[cdefgh]1
It's currently at 2.2% recovered for the extra drive.
mark
How many TB?
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 15:03, mark ha scritto:
> I've no idea what happened, but the box I was working on last week > has a *second* bad drive. Actually, I'm starting to wonder about > that particulare hot-swap bay. > > Anyway, mdadm --detail shows /dev/sdb1 remove. I've added > /dev/sdi1... > but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find > a reliable way to make either one active. > > Actually, I would have expected the linux RAID to replace a failed > one with a spare....
can you report your raid configuration like raid level and raid devices and the current status from /proc/mdstat?
Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
For many years now I'm only doing RAID1 now because it's just safer then RAID5 and easier than RAID6 if the number of disks is low.
I also don't have much experience with spare handling as I also don't do it in my scenarios.
However in general, I think the problem today is this: We have very large disks these days. Defects on a disk are often not found for a long time. Even with raid-check, I think it doesn't find errors which only happen while writing, not while reading only.
So now, if one disk fails, things are still okay. Then, when a spare is in place or the defective disk was replaced, the resync starts. Now, if there is any error on one of the old disks while the resync happens, boom, the array fails and is in a bad state now.
I once had to recover a broken RAID5 from some linux based NAS and what I did was: * Dump the complete raid partition from every disk to a file, ignoring the read errors on one of the disks. * Build the RAID5 like this:
mdadm --create --assume-clean --level=5 --raid-devices=4 --spare-devices=0 \ --metadata=1.0 --layout=left-symmetric --chunk=64 --bitmap=none \ /dev/md10 /dev/loop0 missing /dev/loop2 /dev/loop3
* Recover 99.9% of the data from /dev/md10.
One more hint for those interested: Even with RAID1, I don't use the whole disk as one big RAID1. Instead, I slice it into equally sized parts - not physically :-) - and create multiple smaller RAID1 arrays on it. If a disk is 8TB, I create 8 paritions of 1TB and then create 8 RAID1 arrays on it. Then I add all 8 arrays to the same VG. Now, if there is a small error in, say, disk 3, only a 1TB slice of the whole 8TB is degraded. In large arrays you can even keep some spare slices on a spare disk to temporary move broken slices. You get the idea, right?
Hope that help, Simon
Il 30/01/19 16:49, Simon Matter ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto:
Alessandro Baggi wrote: > Il 29/01/19 15:03, mark ha scritto: > >> I've no idea what happened, but the box I was working on last week >> has a *second* bad drive. Actually, I'm starting to wonder about >> that particulare hot-swap bay. >> >> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >> /dev/sdi1... >> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to find >> a reliable way to make either one active. >> >> Actually, I would have expected the linux RAID to replace a failed >> one with a spare....
> can you report your raid configuration like raid level and raid > devices > and the current status from /proc/mdstat? > Well, nope. I got to the point of rebooting the system (xfs had the RAID volume, and wouldn't let go; I also commented out the RAID volume.
It's RAID 5, /dev/sdb *also* appears to have died. If I do mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: looking for devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0. mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. mdadm: no uptodate device for slot 1 of /dev/md0 mdadm: added /dev/sde1 to /dev/md0 as 2 mdadm: added /dev/sdf1 to /dev/md0 as 3 mdadm: added /dev/sdg1 to /dev/md0 as 4 mdadm: no uptodate device for slot 5 of /dev/md0 mdadm: added /dev/sdd1 to /dev/md0 as -1 mdadm: added /dev/sdh1 to /dev/md0 as -1 mdadm: added /dev/sdc1 to /dev/md0 as 0 mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough to start the array.
--examine shows me /dev/sdd1 and /dev/sdh1, but that both are spares.
Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
For many years now I'm only doing RAID1 now because it's just safer then RAID5 and easier than RAID6 if the number of disks is low.
Like you, I run always raid1 but in the last year I run a raid5 with 3tb wd red for my personal backup server but never got an error for the time.
What about RAID10 vs RAID5, RAID6? You loss half size but is performant as raid5 e reliable as raid1.
Have you tried other type of raid like RAID50 or RAID60?
About resync process, all type of raid level are disk killer during this procedure or only raid5 (and similar) is a disk killer?
I also don't have much experience with spare handling as I also don't do it in my scenarios.
However in general, I think the problem today is this: We have very large disks these days. Defects on a disk are often not found for a long time. Even with raid-check, I think it doesn't find errors which only happen while writing, not while reading only.
So now, if one disk fails, things are still okay. Then, when a spare is in place or the defective disk was replaced, the resync starts. Now, if there is any error on one of the old disks while the resync happens, boom, the array fails and is in a bad state now.
I once had to recover a broken RAID5 from some linux based NAS and what I did was:
- Dump the complete raid partition from every disk to a file, ignoring the
read errors on one of the disks.
- Build the RAID5 like this:
mdadm --create --assume-clean --level=5 --raid-devices=4 --spare-devices=0 \ --metadata=1.0 --layout=left-symmetric --chunk=64 --bitmap=none \ /dev/md10 /dev/loop0 missing /dev/loop2 /dev/loop3
- Recover 99.9% of the data from /dev/md10.
Why not recover directly from backup? This saves time. From your last command why you inserted /dev/loopN?
One more hint for those interested: Even with RAID1, I don't use the whole disk as one big RAID1. Instead, I slice it into equally sized parts - not physically :-) - and create multiple smaller RAID1 arrays on it. If a disk is 8TB, I create 8 paritions of 1TB and then create 8 RAID1 arrays on it. Then I add all 8 arrays to the same VG. Now, if there is a small error in, say, disk 3, only a 1TB slice of the whole 8TB is degraded. In large arrays you can even keep some spare slices on a spare disk to temporary move broken slices. You get the idea, right?
About this type of configuration if you have 2 disks and create 8 raid1 on this two disks, you won't lose performances? As you said if in a single partition you got some bad error you save other data but if one disk fails totally you had the same problem more you need to recreate 8 partition, resync 8 raid1. This could require more time to recovery and possibly more human error.
Hope that help, Simon
Alessandro Baggi wrote:
Il 30/01/19 16:49, Simon Matter ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
<MVNCH>
I also don't have much experience with spare handling as I also don't do it in my scenarios.
However in general, I think the problem today is this: We have very large disks these days. Defects on a disk are often not found for a long time. Even with raid-check, I think it doesn't find errors which only happen while writing, not while reading only.
So now, if one disk fails, things are still okay. Then, when a spare is in place or the defective disk was replaced, the resync starts. Now, if there is any error on one of the old disks while the resync happens, boom, the array fails and is in a bad state now.
<snip>
One more hint for those interested: Even with RAID1, I don't use the whole disk as one big RAID1. Instead, I slice it into equally sized parts - not physically :-) - and create multiple smaller RAID1 arrays on it. If a disk is 8TB, I create 8 paritions of 1TB and then create 8 RAID1 arrays on it. Then I add all 8 arrays to the same VG. Now, if there is a small error in, say, disk 3, only a 1TB slice of the whole 8TB is degraded. In large arrays you can even keep some spare slices on a spare disk to temporary move broken slices. You get the idea, right?
About this type of configuration if you have 2 disks and create 8 raid1 on this two disks, you won't lose performances? As you said if in a single partition you got some bad error you save other data but if one disk fails totally you had the same problem more you need to recreate 8 partition, resync 8 raid1. This could require more time to recovery and possibly more human error.
Not anything I can do. We have users with terabytes of data. We *need* large RAIDS. RAID 1 for root, sure, but nothing else. This specific RAID was unusual, for this user. Normally, for the last five or six or so years, we do RAID 6.
Should I mentioned the RAID 6 we have that's 153TB, with 27% full?
mark
Il 30/01/19 16:49, Simon Matter ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote:
Il 29/01/19 18:47, mark ha scritto: > Alessandro Baggi wrote: >> Il 29/01/19 15:03, mark ha scritto: >> >>> I've no idea what happened, but the box I was working on last >>> week >>> has a *second* bad drive. Actually, I'm starting to wonder about >>> that particulare hot-swap bay. >>> >>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >>> /dev/sdi1... >>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to >>> find >>> a reliable way to make either one active. >>> >>> Actually, I would have expected the linux RAID to replace a >>> failed >>> one with a spare....
>> can you report your raid configuration like raid level and raid >> devices >> and the current status from /proc/mdstat? >> > Well, nope. I got to the point of rebooting the system (xfs had the > RAID > volume, and wouldn't let go; I also commented out the RAID volume. > > It's RAID 5, /dev/sdb *also* appears to have died. If I do > mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: > looking > for > devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of > /dev/md0, slot 0. > mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. > mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. > mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. > mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. > mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. > mdadm: no uptodate device for slot 1 of /dev/md0 > mdadm: added /dev/sde1 to /dev/md0 as 2 > mdadm: added /dev/sdf1 to /dev/md0 as 3 > mdadm: added /dev/sdg1 to /dev/md0 as 4 > mdadm: no uptodate device for slot 5 of /dev/md0 > mdadm: added /dev/sdd1 to /dev/md0 as -1 > mdadm: added /dev/sdh1 to /dev/md0 as -1 > mdadm: added /dev/sdc1 to /dev/md0 as 0 > mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough > to > start the array. > > --examine shows me /dev/sdd1 and /dev/sdh1, but that both are > spares. Hi Mark, please post the result from
cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
For many years now I'm only doing RAID1 now because it's just safer then RAID5 and easier than RAID6 if the number of disks is low.
Like you, I run always raid1 but in the last year I run a raid5 with 3tb wd red for my personal backup server but never got an error for the time.
What about RAID10 vs RAID5, RAID6? You loss half size but is performant as raid5 e reliable as raid1.
I did RAID10 in the past but don't do it now. If you do large linear read/writes, RAID10 may perform better, if you have lots of independent and random read/writes, RAID1 may perform better. It really depends a lot on how the disk are used.
Have you tried other type of raid like RAID50 or RAID60?
Yes I did in the past it even adds more complexity than I like.
About resync process, all type of raid level are disk killer during this procedure or only raid5 (and similar) is a disk killer?
I don't call it a disk killer, it's more that it detects disks errors but does not produce them.
I also don't have much experience with spare handling as I also don't do it in my scenarios.
However in general, I think the problem today is this: We have very large disks these days. Defects on a disk are often not found for a long time. Even with raid-check, I think it doesn't find errors which only happen while writing, not while reading only.
So now, if one disk fails, things are still okay. Then, when a spare is in place or the defective disk was replaced, the resync starts. Now, if there is any error on one of the old disks while the resync happens, boom, the array fails and is in a bad state now.
I once had to recover a broken RAID5 from some linux based NAS and what I did was:
- Dump the complete raid partition from every disk to a file, ignoring
the read errors on one of the disks.
- Build the RAID5 like this:
mdadm --create --assume-clean --level=5 --raid-devices=4 --spare-devices=0 \ --metadata=1.0 --layout=left-symmetric --chunk=64 --bitmap=none \ /dev/md10 /dev/loop0 missing /dev/loop2 /dev/loop3
- Recover 99.9% of the data from /dev/md10.
Why not recover directly from backup? This saves time. From your last command why you inserted /dev/loopN?
I that case, the owner of the NAS was a photographer who had all his past work on the NAS with no real backup :-(
What I did in that case was to dump all data from all disks of the array to files. Then I made copies of the original dump files to work with them. I didn't want to touch the disks more than needed.
One more hint for those interested: Even with RAID1, I don't use the whole disk as one big RAID1. Instead, I slice it into equally sized parts - not physically :-) - and create multiple smaller RAID1 arrays on it. If a disk is 8TB, I create 8 paritions of 1TB and then create 8 RAID1 arrays on it. Then I add all 8 arrays to the same VG. Now, if there is a small error in, say, disk 3, only a 1TB slice of the whole 8TB is degraded. In large arrays you can even keep some spare slices on a spare disk to temporary move broken slices. You get the idea, right?
About this type of configuration if you have 2 disks and create 8 raid1 on this two disks, you won't lose performances? As you said if in a
Performance is the same, with maybe 0.1% overhead.
single partition you got some bad error you save other data but if one disk fails totally you had the same problem more you need to recreate 8
That's true, but in almost three decades of work with harddisks, complete disk failures were rarely seen.
partition, resync 8 raid1. This could require more time to recovery and possibly more human error.
That's true about human errors. But in this case, I usually create small scripts to do it, and I really look at those scripts very carefully before I run them :-)
Regards, Simon
Il 31/01/19 07:34, Simon Matter ha scritto:
Il 30/01/19 16:49, Simon Matter ha scritto:
On 01/30/19 03:45, Alessandro Baggi wrote:
Il 29/01/19 20:42, mark ha scritto:
Alessandro Baggi wrote: > Il 29/01/19 18:47, mark ha scritto: >> Alessandro Baggi wrote: >>> Il 29/01/19 15:03, mark ha scritto: >>> >>>> I've no idea what happened, but the box I was working on last >>>> week >>>> has a *second* bad drive. Actually, I'm starting to wonder about >>>> that particulare hot-swap bay. >>>> >>>> Anyway, mdadm --detail shows /dev/sdb1 remove. I've added >>>> /dev/sdi1... >>>> but see both /dev/sdh1 and /dev/sdi1 as spare, and have yet to >>>> find >>>> a reliable way to make either one active. >>>> >>>> Actually, I would have expected the linux RAID to replace a >>>> failed >>>> one with a spare....
>>> can you report your raid configuration like raid level and raid >>> devices >>> and the current status from /proc/mdstat? >>> >> Well, nope. I got to the point of rebooting the system (xfs had the >> RAID >> volume, and wouldn't let go; I also commented out the RAID volume. >> >> It's RAID 5, /dev/sdb *also* appears to have died. If I do >> mdadm --assemble --force -v /dev/md0 /dev/sd[cefgdh]1 mdadm: >> looking >> for >> devices for /dev/md0 mdadm: /dev/sdc1 is identified as a member of >> /dev/md0, slot 0. >> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot -1. >> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2. >> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3. >> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 4. >> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot -1. >> mdadm: no uptodate device for slot 1 of /dev/md0 >> mdadm: added /dev/sde1 to /dev/md0 as 2 >> mdadm: added /dev/sdf1 to /dev/md0 as 3 >> mdadm: added /dev/sdg1 to /dev/md0 as 4 >> mdadm: no uptodate device for slot 5 of /dev/md0 >> mdadm: added /dev/sdd1 to /dev/md0 as -1 >> mdadm: added /dev/sdh1 to /dev/md0 as -1 >> mdadm: added /dev/sdc1 to /dev/md0 as 0 >> mdadm: /dev/md0 assembled from 4 drives and 2 spares - not enough >> to >> start the array. >> >> --examine shows me /dev/sdd1 and /dev/sdh1, but that both are >> spares. > Hi Mark, > please post the result from > > cat /sys/block/md0/md/sync_action
There is none. There is no /dev/md0. mdadm refusees, saying that it's lost too many drives.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I suppose that your config is 5 drive and 1 spare with 1 drive failed. It's strange that your spare was not used for resync. Then you added a new drive but it does not start because it marks the new disk as spare and you have a raid5 with 4 devices and 2 spares.
First I hope that you have a backup for all your data and don't run some exotic command before backupping your data. If you can't backup your data, it's a problem.
This is at work. We have automated nightly backups, and I do offline backups of the backups every two weeks.
Have you tried to remove the last added device sdi1 and restart the raid and force to start a resync?
The thing is, it had one? two? spares when /dev/sdb1 started dying, and it didn't use them.
For many years now I'm only doing RAID1 now because it's just safer then RAID5 and easier than RAID6 if the number of disks is low.
Like you, I run always raid1 but in the last year I run a raid5 with 3tb wd red for my personal backup server but never got an error for the time.
What about RAID10 vs RAID5, RAID6? You loss half size but is performant as raid5 e reliable as raid1.
I did RAID10 in the past but don't do it now. If you do large linear read/writes, RAID10 may perform better, if you have lots of independent and random read/writes, RAID1 may perform better. It really depends a lot on how the disk are used.
Have you tried other type of raid like RAID50 or RAID60?
Yes I did in the past it even adds more complexity than I like.
About resync process, all type of raid level are disk killer during this procedure or only raid5 (and similar) is a disk killer?
I don't call it a disk killer, it's more that it detects disks errors but does not produce them.
I also don't have much experience with spare handling as I also don't do it in my scenarios.
However in general, I think the problem today is this: We have very large disks these days. Defects on a disk are often not found for a long time. Even with raid-check, I think it doesn't find errors which only happen while writing, not while reading only.
So now, if one disk fails, things are still okay. Then, when a spare is in place or the defective disk was replaced, the resync starts. Now, if there is any error on one of the old disks while the resync happens, boom, the array fails and is in a bad state now.
I once had to recover a broken RAID5 from some linux based NAS and what I did was:
- Dump the complete raid partition from every disk to a file, ignoring
the read errors on one of the disks.
- Build the RAID5 like this:
mdadm --create --assume-clean --level=5 --raid-devices=4 --spare-devices=0 \ --metadata=1.0 --layout=left-symmetric --chunk=64 --bitmap=none \ /dev/md10 /dev/loop0 missing /dev/loop2 /dev/loop3
- Recover 99.9% of the data from /dev/md10.
Why not recover directly from backup? This saves time. From your last command why you inserted /dev/loopN?
I that case, the owner of the NAS was a photographer who had all his past work on the NAS with no real backup :-(
What I did in that case was to dump all data from all disks of the array to files. Then I made copies of the original dump files to work with them. I didn't want to touch the disks more than needed.
One more hint for those interested: Even with RAID1, I don't use the whole disk as one big RAID1. Instead, I slice it into equally sized parts - not physically :-) - and create multiple smaller RAID1 arrays on it. If a disk is 8TB, I create 8 paritions of 1TB and then create 8 RAID1 arrays on it. Then I add all 8 arrays to the same VG. Now, if there is a small error in, say, disk 3, only a 1TB slice of the whole 8TB is degraded. In large arrays you can even keep some spare slices on a spare disk to temporary move broken slices. You get the idea, right?
About this type of configuration if you have 2 disks and create 8 raid1 on this two disks, you won't lose performances? As you said if in a
Performance is the same, with maybe 0.1% overhead.
single partition you got some bad error you save other data but if one disk fails totally you had the same problem more you need to recreate 8
That's true, but in almost three decades of work with harddisks, complete disk failures were rarely seen.
partition, resync 8 raid1. This could require more time to recovery and possibly more human error.
That's true about human errors. But in this case, I usually create small scripts to do it, and I really look at those scripts very carefully before I run them :-)
Regards, Simon
Hi Simon, thank you for your reply.
Best regards, Alessandro.