hi all!
back in Aug several of you assisted me in solving a problem where one of my drives had dropped out of (or been kicked out of) the raid1 array.
something vaguely similar appears to have happened just a few mins ago, upon rebooting after a small update. I received four emails like this, one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for /dev/md126:
Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us X-Spambayes-Classification: unsure; 0.24 Status: RO Content-Length: 564 Lines: 23
This is an automatically generated mail message from mdadm running on fcshome.stoneham.ma.us
A DegradedArray event had been detected on md device /dev/md125.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] md0 : active raid1 sda1[0] 104320 blocks [2/1] [U_] md126 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] md125 : active raid1 sdb2[1] 312464128 blocks [2/1] [_U] md1 : active raid1 sda2[0] 312464128 blocks [2/1] [U_] unused devices: <none>
firstly, what the heck are md125 and md126? previously there was only md0 and md1.... ????
secondly, I'm not sure what it's trying to tell me. it says there was a "degradedarray event" but at the bottom it says there are no unused devices.
there are also some messages in /var/log/messages from the time of the boot earlier today, but they do NOT say anything about "kicking out" any of the md member devices (as they did in the event back in August):
Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. Oct 19 18:29:41 fcshome kernel: md: autorun ... Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: created md125 Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir rors Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock to sdb1 Oct 19 18:29:41 fcshome kernel: md: created md126 Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 Oct 19 18:29:41 fcshome kernel: md: created md1 Oct 19 18:29:41 fcshome kernel: md: bind<sda2> Oct 19 18:29:41 fcshome kernel: md: running: <sda2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... Oct 19 18:29:41 fcshome kernel: md: created md0 Oct 19 18:29:41 fcshome kernel: md: bind<sda1> Oct 19 18:29:41 fcshome kernel: md: running: <sda1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
and here's /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR fredex ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
which doesn't say anything about md125 or md126,... might they be some kind of detritus or fragments left over from whatever kind of failure caused the array to become degraded?
do ya suppose a boot from power-off might somehow give it a whack upside the head so it'll reassemble itself according to mdadm.conf?
I'm not sure which devices need to be failed and re-added to make it clean again (which is all I had to do when I had the aforementioned earlier problem.)
Thanks in advance for any advice!
Fred
fred smith wrote:
hi all!
back in Aug several of you assisted me in solving a problem where one of my drives had dropped out of (or been kicked out of) the raid1 array.
something vaguely similar appears to have happened just a few mins ago, upon rebooting after a small update. I received four emails like this, one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for /dev/md126:
Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us X-Spambayes-Classification: unsure; 0.24 Status: RO Content-Length: 564 Lines: 23
This is an automatically generated mail message from mdadm running on fcshome.stoneham.ma.us
A DegradedArray event had been detected on md device /dev/md125.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] md0 : active raid1 sda1[0] 104320 blocks [2/1] [U_]
md126 : active raid1 sdb1[1] 104320 blocks [2/1] [_U]
md125 : active raid1 sdb2[1] 312464128 blocks [2/1] [_U]
md1 : active raid1 sda2[0] 312464128 blocks [2/1] [U_]
unused devices: <none>
firstly, what the heck are md125 and md126? previously there was only md0 and md1.... ????
secondly, I'm not sure what it's trying to tell me. it says there was a "degradedarray event" but at the bottom it says there are no unused devices.
there are also some messages in /var/log/messages from the time of the boot earlier today, but they do NOT say anything about "kicking out" any of the md member devices (as they did in the event back in August):
Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. Oct 19 18:29:41 fcshome kernel: md: autorun ... Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock to sdb2
This appears to be the cause
Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: created md125
this was auto created - I've not experienced this myself and run half a dozen of these on different machines.
Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir rors
now it has mounted it separately
Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock to sdb1
and now for the second one
Oct 19 18:29:41 fcshome kernel: md: created md126 Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 Oct 19 18:29:41 fcshome kernel: md: created md1 Oct 19 18:29:41 fcshome kernel: md: bind<sda2> Oct 19 18:29:41 fcshome kernel: md: running: <sda2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... Oct 19 18:29:41 fcshome kernel: md: created md0 Oct 19 18:29:41 fcshome kernel: md: bind<sda1> Oct 19 18:29:41 fcshome kernel: md: running: <sda1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
and here's /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR fredex ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
which doesn't say anything about md125 or md126,... might they be some kind of detritus or fragments left over from whatever kind of failure caused the array to become degraded?
now you need to decide (by looking at each device (may need to mount first.)) which is the correct master. remove the other one and add it back to the original array - it will them rebuild. If these are SATA drives just check the cable - I have one machine where they work loose and cause failures.
do ya suppose a boot from power-off might somehow give it a whack upside the head so it'll reassemble itself according to mdadm.conf?
doubt it - see the above dmesg.
I'm not sure which devices need to be failed and re-added to make it clean again (which is all I had to do when I had the aforementioned earlier problem.)
Thanks in advance for any advice!
Fred
On Tue, Oct 19, 2010 at 7:59 PM, fred smith fredex@fcshome.stoneham.ma.us wrote:
back in Aug several of you assisted me in solving a problem where one of my drives had dropped out of (or been kicked out of) the raid1 array.
something vaguely similar appears to have happened just a few mins ago, upon rebooting after a small update. I received four emails like this, one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for /dev/md126:
Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us X-Spambayes-Classification: unsure; 0.24 Status: RO Content-Length: 564 Lines: 23
This is an automatically generated mail message from mdadm running on fcshome.stoneham.ma.us
A DegradedArray event had been detected on md device /dev/md125.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] md0 : active raid1 sda1[0] 104320 blocks [2/1] [U_]
md126 : active raid1 sdb1[1] 104320 blocks [2/1] [_U]
md125 : active raid1 sdb2[1] 312464128 blocks [2/1] [_U]
md1 : active raid1 sda2[0] 312464128 blocks [2/1] [U_]
unused devices: <none>
firstly, what the heck are md125 and md126? previously there was only md0 and md1.... ????
secondly, I'm not sure what it's trying to tell me. it says there was a "degradedarray event" but at the bottom it says there are no unused devices.
there are also some messages in /var/log/messages from the time of the boot earlier today, but they do NOT say anything about "kicking out" any of the md member devices (as they did in the event back in August):
Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. Oct 19 18:29:41 fcshome kernel: md: autorun ... Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: created md125 Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir rors Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock to sdb1 Oct 19 18:29:41 fcshome kernel: md: created md126 Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 Oct 19 18:29:41 fcshome kernel: md: created md1 Oct 19 18:29:41 fcshome kernel: md: bind<sda2> Oct 19 18:29:41 fcshome kernel: md: running: <sda2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... Oct 19 18:29:41 fcshome kernel: md: created md0 Oct 19 18:29:41 fcshome kernel: md: bind<sda1> Oct 19 18:29:41 fcshome kernel: md: running: <sda1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
and here's /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR fredex ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
which doesn't say anything about md125 or md126,... might they be some kind of detritus or fragments left over from whatever kind of failure caused the array to become degraded?
The superblocks in sdb1 and sdb2 is different from the superblocks in sda1 and sda2 so mdadm assembled sdb1 and sdb2 into different arrays. I'd have expected them to be md126 and md127 not md125 and md126 bu that's normal.
Your problem is that all four arrays are degraded.
Which ones are mounted? Assuming that you're running off the drives with the most recent changes and updates, you'll have to stop the two unused arrays, zero the superblocks, and add them to the running arrays.
fred smith wrote:helppain/backups/disks/
hi all!
back in Aug several of you assisted me in solving a problem where one of my drives had dropped out of (or been kicked out of) the raid1 array.
something vaguely similar appears to have happened just a few mins ago, upon rebooting after a small update. I received four emails like this, one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for /dev/md126:
Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us X-Spambayes-Classification: unsure; 0.24 Status: RO Content-Length: 564 Lines: 23
This is an automatically generated mail message from mdadm running on fcshome.stoneham.ma.us
A DegradedArray event had been detected on md device /dev/md125.
Faithfully yours, etc.resources/
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] md0 : active raid1 sda1[0] 104320 blocks [2/1] [U_]
md126 : active raid1 sdb1[1] 104320 blocks [2/1] [_U]
md125 : active raid1 sdb2[1] 312464128 blocks [2/1] [_U]
md1 : active raid1 sda2[0] 312464128 blocks [2/1] [U_]
unused devices: <none>
firstly, what the heck are md125 and md126? previously there was only md0 and md1.... ????
secondly, I'm not sure what it's trying to tell me. it says there was a "degradedarray event" but at the bottom it says there are no unused devices.
there are also some messages in /var/log/messages from the time of the boot earlier today, but they do NOT say anything about "kicking out" any of the md member devices (as they did in the event back in August):
Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. Oct 19 18:29:41 fcshome kernel: md: autorun ... Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: created md125 Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir rors Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock to sdb1 Oct 19 18:29:41 fcshome kernel: md: created md126 Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 Oct 19 18:29:41 fcshome kernel: md: created md1 Oct 19 18:29:41 fcshome kernel: md: bind<sda2> Oct 19 18:29:41 fcshome kernel: md: running: <sda2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... Oct 19 18:29:41 fcshome kernel: md: created md0 Oct 19 18:29:41 fcshome kernel: md: bind<sda1> Oct 19 18:29:41 fcshome kernel: md: running: <sda1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
and here's /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR fredex ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
which doesn't say anything about md125 or md126,... might they be some kind of detritus or fragments left over from whatever kind of failure caused the array to become degraded?
do ya suppose a boot from power-off might somehow give it a whack upside the head so it'll reassemble itself according to mdadm.conf?
I'm not sure which devices need to be failed and re-added to make it clean again (which is all I had to do when I had the aforementioned earlier problem.)
Thanks in advance for any advice!
Fred
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
Adding the bitmap is very worthwhile and saves time and risk of data loss by not having to recopy the whole partition.
Nataraj
Nataraj wrote:
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
Adding the bitmap is very worthwhile and saves time and risk of data loss by not having to recopy the whole partition.
Nataraj _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
mdadm /dev/mdN --assemble --force could also be useful, though I would be careful here. To use this, you would have to stop all of the arrays and then reassemble. You could also specify the specific drives. If you don't have a backup, you might want to backup the single drives that are properly mounted from md0 and md1. Data loss is always a possibility with these type of manipulations, though I have successfully recovered from things like this without losing any data. In fact I pull drives out of a raid array and add new drives in daily to sync them and send the second drive off site as a backup.
Nataraj
On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
fred smith wrote:helppain/backups/disks/
hi all!
back in Aug several of you assisted me in solving a problem where one of my drives had dropped out of (or been kicked out of) the raid1 array.
something vaguely similar appears to have happened just a few mins ago, upon rebooting after a small update. I received four emails like this, one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for /dev/md126:
Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us X-Spambayes-Classification: unsure; 0.24 Status: RO Content-Length: 564 Lines: 23
This is an automatically generated mail message from mdadm running on fcshome.stoneham.ma.us
A DegradedArray event had been detected on md device /dev/md125.
Faithfully yours, etc.resources/
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] md0 : active raid1 sda1[0] 104320 blocks [2/1] [U_]
md126 : active raid1 sdb1[1] 104320 blocks [2/1] [_U]
md125 : active raid1 sdb2[1] 312464128 blocks [2/1] [_U]
md1 : active raid1 sda2[0] 312464128 blocks [2/1] [U_]
unused devices: <none>
firstly, what the heck are md125 and md126? previously there was only md0 and md1.... ????
secondly, I'm not sure what it's trying to tell me. it says there was a "degradedarray event" but at the bottom it says there are no unused devices.
there are also some messages in /var/log/messages from the time of the boot earlier today, but they do NOT say anything about "kicking out" any of the md member devices (as they did in the event back in August):
Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized v0.2594l Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays. Oct 19 18:29:41 fcshome kernel: md: autorun ... Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb2 ... Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock to sdb2 Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2 Oct 19 18:29:41 fcshome kernel: md: created md125 Oct 19 18:29:41 fcshome kernel: md: bind<sdb2> Oct 19 18:29:41 fcshome kernel: md: running: <sdb2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out of 2 mir rors Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ... Oct 19 18:29:41 fcshome kernel: md: adding sdb1 ... Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1 Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock to sdb1 Oct 19 18:29:41 fcshome kernel: md: created md126 Oct 19 18:29:41 fcshome kernel: md: bind<sdb1> Oct 19 18:29:41 fcshome kernel: md: running: <sdb1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda2 ... Oct 19 18:29:41 fcshome kernel: md: adding sda2 ... Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2 Oct 19 18:29:41 fcshome kernel: md: created md1 Oct 19 18:29:41 fcshome kernel: md: bind<sda2> Oct 19 18:29:41 fcshome kernel: md: running: <sda2> Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: considering sda1 ... Oct 19 18:29:41 fcshome kernel: md: adding sda1 ... Oct 19 18:29:41 fcshome kernel: md: created md0 Oct 19 18:29:41 fcshome kernel: md: bind<sda1> Oct 19 18:29:41 fcshome kernel: md: running: <sda1> Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out of 2 mirrors Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
and here's /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda DEVICE partitions MAILADDR fredex ARRAY /dev/md0 level=raid1 num-devices=2 uuid=4eb13e45:b5228982:f03cd503:f935bd69 ARRAY /dev/md1 level=raid1 num-devices=2 uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
which doesn't say anything about md125 or md126,... might they be some kind of detritus or fragments left over from whatever kind of failure caused the array to become degraded?
do ya suppose a boot from power-off might somehow give it a whack upside the head so it'll reassemble itself according to mdadm.conf?
I'm not sure which devices need to be failed and re-added to make it clean again (which is all I had to do when I had the aforementioned earlier problem.)
Thanks in advance for any advice!
Fred
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
Well, I've already tried to use --fail and --remove on md125 and md126 but I'm told the members are still active.
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md126 mdadm: hot remove failed for /dev/sdb1: Device or resource busy
with the intention of then re-adding them to md0 and md1.
so I tried:
mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 and got a similar message.
at which point I knew I was in over my head.
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
Adding the bitmap is very worthwhile and saves time and risk of data loss by not having to recopy the whole partition.
Nataraj _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
fred smith wrote:
On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
Well, I've already tried to use --fail and --remove on md125 and md126 but I'm told the members are still active.
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
You want to use --stop for the md125 and md126. Those are the raid devices that are not correct. Once they are stopped, you can take the drives from them and return them to md0 and md1 where they belong.
You will need to add the correct drive that was originally paired in each raid set, but as I mentioned, it won't let you add the wrong drives, so just try adding sdb1 to md0, then if it doesn't work, add it to sdb1. You can't fail out drives from arrays that only have one drive.
Nataraj
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md126
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
with the intention of then re-adding them to md0 and md1.
so I tried:
mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 and got a similar message.
at which point I knew I was in over my head.
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
Adding the bitmap is very worthwhile and saves time and risk of data loss by not having to recopy the whole partition.
Nataraj _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, Oct 21, 2010 at 08:59:13AM -0700, Nataraj wrote:
fred smith wrote:
On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
Well, I've already tried to use --fail and --remove on md125 and md126 but I'm told the members are still active.
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
You want to use --stop for the md125 and md126. Those are the raid devices that are not correct. Once they are stopped, you can take the drives from them and return them to md0 and md1 where they belong.!
You will need to add the correct drive that was originally paired in each raid set, but as I mentioned, it won't let you add the wrong drives, so just try adding sdb1 to md0, then if it doesn't work, add it to sdb1. You can't fail out drives from arrays that only have one drive.
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md126
mdadm: hot remove failed for /dev/sdb1: Device or resource busy
with the intention of then re-adding them to md0 and md1.
so I tried:
mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 and got a similar message.
at which point I knew I was in over my head.
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
Adding the bitmap is very worthwhile and saves time and risk of data loss by not having to recopy the whole partition.
Nataraj
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
If that doesn't work, I would try assemble with a --force option, which might be a little more dangerous than the hot add, but probably not much. I can say that when I have a drive fall out of an array I am always able to add it back with the first command (-a). As I mentioned, I do have bitmaps on all my arrays, but you can't change that until you rebuild the raidset.
I believe these comands will take care of everything. You shouldn't have to do any diddling of the superblocks at a low level, and if the problem is that bad, you might be best to backup and recreate the whole array or engage the services of someone who knows how to muck with the data structures on the disk. I've never had to use anything other than mdadm to manage my raid arrays and I've never lost data with linux software raid in the 10 or more years that I've been using it. I've found it to be quite robust. Backing up is just a precaution that is a good idea for anyone to take if they care about their data.
If these problems reoccur on a regular basis, you could have a bad drive, a power supply problem or a cabling problem. Assuming your drives are attached to SATA, SCSI or SAS controller, you can use smartctl to check the drives and see if they are getting errors or other faults. smartctl will not work with USB or firefire attached drives.
Nataraj
Nataraj wrote:
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
If that doesn't work, I would try assemble with a --force option, which might be a little more dangerous than the hot add, but probably not much. I can say that when I have a drive fall out of an array I am always able to add it back with the first command (-a). As I mentioned, I do have bitmaps on all my arrays, but you can't change that until you rebuild the raidset.
Note, that if you need to use assemble --force, you must stop the array first and know exactly which drives you want to assemble the array with.
I believe these comands will take care of everything. You shouldn't have to do any diddling of the superblocks at a low level, and if the problem is that bad, you might be best to backup and recreate the whole array or engage the services of someone who knows how to muck with the data structures on the disk. I've never had to use anything other than mdadm to manage my raid arrays and I've never lost data with linux software raid in the 10 or more years that I've been using it. I've found it to be quite robust. Backing up is just a precaution that is a good idea for anyone to take if they care about their data.
If these problems reoccur on a regular basis, you could have a bad drive, a power supply problem or a cabling problem. Assuming your drives are attached to SATA, SCSI or SAS controller, you can use smartctl to check the drives and see if they are getting errors or other faults. smartctl will not work with USB or firefire attached drives.
Nataraj _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Nataraj wrote:
Nataraj wrote:
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
If that doesn't work, I would try assemble with a --force option, which might be a little more dangerous than the hot add, but probably not much. I can say that when I have a drive fall out of an array I am always able to add it back with the first command (-a). As I mentioned, I do have bitmaps on all my arrays, but you can't change that until you rebuild the raidset.
Note, that if you need to use assemble --force, you must stop the array first and know exactly which drives you want to assemble the array with.
It's possible that my drives go back so easily because of the bitmap. You can probably also use --force with the -a option (hot add). If you use --force, I would make sure that you are specifying the write drives/partitions since --force will probably cause whatever partition you give it to be used in the array regardless of weather it was in the same array before. So if you use --force, I would check the UUIDs of the partitions first and make sure they are the same, since --force would allow you to insert one of your md1 partitions into your md0 array.
Nataraj
Nataraj
On Thu, Oct 21, 2010 at 11:03:27AM -0700, Nataraj wrote:
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
Nataraj, that worked fine, didn't need to --force it. Now I'm back to having two devices in md0 and two in md1, and they're the RIGHT two! :) Put the box in single-user mode to do the work, then after the array finished resyncing, rebooted and it came up with the right two md devices.
I appreciate your tutoring me on this, you've been most helpful.
Thanks a bunch!
Oh, can you refer me to any good documentation on how to admin a software raid system? One aimed for people, like me, who are computer literate, but have never trained as a sysadmin, and who don't know much about RAID...
thanks again!
Fred
If that doesn't work, I would try assemble with a --force option, which might be a little more dangerous than the hot add, but probably not much. I can say that when I have a drive fall out of an array I am always able to add it back with the first command (-a). As I mentioned, I do have bitmaps on all my arrays, but you can't change that until you rebuild the raidset.
I believe these comands will take care of everything. You shouldn't have to do any diddling of the superblocks at a low level, and if the problem is that bad, you might be best to backup and recreate the whole array or engage the services of someone who knows how to muck with the data structures on the disk. I've never had to use anything other than mdadm to manage my raid arrays and I've never lost data with linux software raid in the 10 or more years that I've been using it. I've found it to be quite robust. Backing up is just a precaution that is a good idea for anyone to take if they care about their data.
If these problems reoccur on a regular basis, you could have a bad drive, a power supply problem or a cabling problem. Assuming your drives are attached to SATA, SCSI or SAS controller, you can use smartctl to check the drives and see if they are getting errors or other faults. smartctl will not work with USB or firefire attached drives.
Nataraj _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
fred smith wrote:
On Thu, Oct 21, 2010 at 11:03:27AM -0700, Nataraj wrote:
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
Nataraj, that worked fine, didn't need to --force it. Now I'm back to having two devices in md0 and two in md1, and they're the RIGHT two! :) Put the box in single-user mode to do the work, then after the array finished resyncing, rebooted and it came up with the right two md devices.
I appreciate your tutoring me on this, you've been most helpful.
Thanks a bunch!
Oh, can you refer me to any good documentation on how to admin a software raid system? One aimed for people, like me, who are computer literate, but have never trained as a sysadmin, and who don't know much about RAID...
thanks again!
Fred
Hi Fred,
You might try this one, since it seems to be one of the more up to date: https://raid.wiki.kernel.org/index.php/Linux_Raid
Also the mdadm man page and running "mdadm --help".
Oh, and there's this, however, some of the pages just happen to be in Chinese... http://wiki.centos.org/Search?action=fullsearch&titlesearch=1&value=... http://wiki.centos.org/Search?action=fullsearch&titlesearch=1&value=raid
Nataraj
Nataraj
On Sat, Oct 23, 2010 at 10:05:30PM -0700, Nataraj wrote:
fred smith wrote:
On Thu, Oct 21, 2010 at 11:03:27AM -0700, Nataraj wrote:
fred smith wrote:
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
Nataraj
I would first try adding the drives back in with:
mdadm /dev/mdN -a /dev/sdXn
Again, this is after having stopped the bogus md arrays.
Nataraj, that worked fine, didn't need to --force it. Now I'm back to having two devices in md0 and two in md1, and they're the RIGHT two! :) Put the box in single-user mode to do the work, then after the array finished resyncing, rebooted and it came up with the right two md devices.
I appreciate your tutoring me on this, you've been most helpful.
Thanks a bunch!
Oh, can you refer me to any good documentation on how to admin a software raid system? One aimed for people, like me, who are computer literate, but have never trained as a sysadmin, and who don't know much about RAID...
thanks again!
Fred
Hi Fred,
You might try this one, since it seems to be one of the more up to date: https://raid.wiki.kernel.org/index.php/Linux_Raid
Also the mdadm man page and running "mdadm --help".
Oh, and there's this, however, some of the pages just happen to be in Chinese... http://wiki.centos.org/Search?action=fullsearch&titlesearch=1&value=... http://wiki.centos.org/Search?action=fullsearch&titlesearch=1&value=raid
Once again, thanks! the raid.wiki... one looks good. I'm afraid that a Chinese-language resource isn't going to help me much... but perhaps someone else will see this and find it helpful.
on 10-21-2010 9:13 AM fred smith spake the following:
On Thu, Oct 21, 2010 at 08:59:13AM -0700, Nataraj wrote:
fred smith wrote:
On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
I've seen this kind of thing happen when the autodetection stuff misbehaves. I'm not sure why it does this or how to prevent it. Anyway, to recover, I would use something like:
mdadm --stop /dev/md125 mdadm --stop /dev/md126
If for some reason the above commands fail, check and make sure it has not automounted the file systems from md125 and md126. Hopefully this won't happen.
Then use: mdadm /dev/md0 -a /dev/sdXX To add back the drive which belongs in md0, and similar for md1. In general, it won't let you add the wrong drive, but if you want to check use: mdadm --examine /dev/sda1 | grep UUID and so forth for all your drives and find the ones with the same UUID.
Well, I've already tried to use --fail and --remove on md125 and md126 but I'm told the members are still active.
mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1 mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
You want to use --stop for the md125 and md126. Those are the raid devices that are not correct. Once they are stopped, you can take the drives from them and return them to md0 and md1 where they belong.!
You will need to add the correct drive that was originally paired in each raid set, but as I mentioned, it won't let you add the wrong drives, so just try adding sdb1 to md0, then if it doesn't work, add it to sdb1. You can't fail out drives from arrays that only have one drive.
Thanks for the additional information.
I'll try backing up everything this weekend then will take a stab at it.
someone said earlier that the differing raid superblocks were probably the cause of the misassignment in the first place. but I have no clue how the superblocks could have become messed up, can any of you comment on that? willl I need to hack at that issue, too, before I can succeed?
thanks again!
If the system lost power or otherwise went off before all superblock data was flushed, that could have corrupted the data.I would assume that the oddball devices were the corrupt ones, but unless you have something to compare to, it is hard to be sure
On Wed, Oct 20, 2010 at 4:34 AM, Nataraj incoming-centos@rjl.com wrote:
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
How do you add --bitmap=internal to an existing, running RAID set? I have tried with the command above but got the following error:
[root@intranet ~]# mdadm /dev/md2 --bitmap=internal mdadm: -b cannot have any extra immediately after it, sorry. [root@intranet ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hda1[0] 104320 blocks [2/1] [U_]
md2 : active raid1 sda1[0] sdb1[1] 244195904 blocks [2/2] [UU]
md1 : active raid1 hda2[0] 244091520 blocks [2/1] [U_]
Rudi Ahlers wrote:
On Wed, Oct 20, 2010 at 4:34 AM, Nataraj incoming-centos@rjl.com wrote:
When I create my Raid arrays, I always use the option --bitmap=internal. With this option set, a bitmap is used to keep track of which pages on the drive are out of date and then you only resync pages which need updating instead of recopying the whole drive when this happens. In the past I once added a bitmap to an existing raid1 array using something like this. This may not be the exact command, but I know it can be done: mdadm /dev/mdN --bitmap=internal
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
How do you add --bitmap=internal to an existing, running RAID set? I have tried with the command above but got the following error:
try mdadm /dev/md2 -Gb internal also it pays to have everything clean first and check you have a persistent superblock i.e. mdadm -D /dev/md2 HTH
[root@intranet ~]# mdadm /dev/md2 --bitmap=internal mdadm: -b cannot have any extra immediately after it, sorry. [root@intranet ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hda1[0] 104320 blocks [2/1] [U_]
md2 : active raid1 sda1[0] sdb1[1] 244195904 blocks [2/2] [UU]
md1 : active raid1 hda2[0] 244091520 blocks [2/1] [U_]