Hey,
I installed 2 new data servers with a big (12TB) RAID6 mdraid.
I formated the whole arrays with bad blocks checks. One server is moderately used (nfs on one md), while the other not.
One week later, after the raid-check from cron, I get on both servers
a few block_mismatch... 1976162368 on the used one and a tiny bit less
on the other... That seems a tiny little bit high...
I do the whole repair+recheck and it is back to zero. For brand new arrays, I am not very happy about the way mdraid
seems to function (with the manual/croned checks, repairs, checks,
checks, repairs, checks...) I will have to see next week how high the mismatches reach...
Since my desktop is a RAID1 mdraid on 2 disks, I decided to have a look for
fun... Apart from some low count mismatches, I did not have many problems... Did the whole check+repair+check on 3 mds and had a look at mdstat...
md0 : active raid1 sdb1[1] sda1[0] 200704 blocks [2/2] [UU] md1 : active raid1 sdb2[1] 2048192 blocks [2/1] [_U]
md2 : active raid1 sdb3[1] sda3[0] 2048192 blocks [2/2] [UU] md3 : active raid1 sdb6[1] sda6[0] 6144704 blocks [2/2] [UU] md4 : active raid1 sdb8[1] sda8[0] 2048192 blocks [2/2] [UU] md5 : active raid1 sda7[0] 4096448 blocks [2/1] [U_] md6 : active raid1 sdb5[2](F) sda5[0] 131074176 blocks [2/1] [U_] md7 : active raid1 sdb9[1] sda9[0] 340722432 blocks [2/2] [UU]
It seems like I have some healthy volumes, some "failed partitions",
and even some "missing partitions"... on both disks...
And mdraid seems not alarmed about it... 1. Is there something to activate to get some kind of mdraid warnings? In /var/log/messages I cannot find any alarming message.
2. How to recover? Reboot?
Or should I just mdadm --add the missing ones
and --remove then --add the failed ones?
Thx, JD
On 09.Okt.2013, at 16:55, John Doe wrote:
Hey,
I installed 2 new data servers with a big (12TB) RAID6 mdraid.
...
Since my desktop is a RAID1 mdraid on 2 disks, I decided to have a look for
fun... Apart from some low count mismatches, I did not have many problems... Did the whole check+repair+check on 3 mds and had a look at mdstat...
I think there should not be any count mismatches with raid 6, but... md raid 1 is another beast. Such count mismatches can happen fairly easily.
a page in the virtual memory is modified, eventually it sends it to both disks. One disk is a little bit slower, and you have your potential mismatch. As I understand, the raid check does not care about virtual memory but acts on physical disk sectors. If the raid check checks a block in the very moment where one disk has written it but the other disk *not yet* then..., well, you get it?
I do not know exactly about md's raid 6, but I always thougt that this false positive mismatch count thing was raid 1 specific. Because of this all I also would tend to turn off the weekly raid check cronjob for md raid 1
And mdraid seems not alarmed about it...
- Is there something to activate to get some kind of mdraid warnings? In /var/log/messages I cannot find any alarming message.
$ chkconfig --list mdmonitor mdmonitor 0:off 1:off 2:on 3:on 4:on 5:on 6:off
configure it with a working email address. And there *are* entries in /var/log/messages. Could it be that this happened a long time ago, you did not notice and the log files rotated out?
From: Markus Falb wnefal@gmail.com
I do not know exactly about md's raid 6, but I always thougt that this false positive mismatch count thing was raid 1 specific. Because of this all I also would tend to turn off the weekly raid check cronjob for md raid 1
Ok, so basicaly we should just ignore these mismatches...
And mdraid seems not alarmed about it...
- Is there something to activate to get some kind of mdraid warnings?
In /var/log/messages I cannot find any alarming message.
$ chkconfig --list mdmonitor
Already running...
configure it with a working email address. And there *are* entries in /var/log/messages. Could it be that this happened a long time ago, you did not notice and the log files rotated out?
There are entries... but nothing alarming... By example: kernel: md: syncing RAID array md5 kernel: md: md5: sync done. While mdstat says that this raid device has only 1 partition left... md5 : active raid1 sda7[0] 4096448 blocks [2/1] [U_] I wonder what it did sync it with...
Same for the raid devices that have a failed partition... kernel: md: syncing RAID array md6 kernel: md: md6: sync done.
Maybe I missed a some messages a few weeks/months/years ago... but I would expect the warnings to come back! Anyway, I made a monitoring script that does the checks.
Thx, JD