this has never happened to me before, and I'm somewhat at a loss. got a email from the cron thing...
/etc/cron.weekly/99-raid-check:
WARNING: mismatch_cnt is not 0 on /dev/md10 WARNING: mismatch_cnt is not 0 on /dev/md11
ok, md10 and md11 are each raid1's made from 2 x 72GB scsi drives, on a dell 2850 or something dual single-core 3ghz server.
these two md's are in turn a striped LVM volume group
dmesg shows....
md: syncing RAID array md10 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 143374656 blocks. md: syncing RAID array md11 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 143374656 blocks. md: md10: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdc1 disk 1, wo:0, o:1, dev:sdd1 md: md11: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sde1 disk 1, wo:0, o:1, dev:sdf1
I'm not sure what thats telling me. the last thing prior to this in dmesg was when I added a swap to this vg last week.
and mdadm --detail shows...
# mdadm --detail /dev/md10 /dev/md10: Version : 0.90 Creation Time : Wed Oct 8 12:54:48 2008 Raid Level : raid1 Array Size : 143374656 (136.73 GiB 146.82 GB) Used Dev Size : 143374656 (136.73 GiB 146.82 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 10 Persistence : Superblock is persistent
Update Time : Sun Feb 28 04:53:29 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0
UUID : b6da4dc5:c7372d6e:63f32b9c:49fa95f9 Events : 0.84
Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 # mdadm --detail /dev/md11 /dev/md11: Version : 0.90 Creation Time : Wed Oct 8 12:54:57 2008 Raid Level : raid1 Array Size : 143374656 (136.73 GiB 146.82 GB) Used Dev Size : 143374656 (136.73 GiB 146.82 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 11 Persistence : Superblock is persistent
Update Time : Sun Feb 28 11:49:45 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0
UUID : be475cd9:b98ee3ff:d18e668c:a5a6e06b Events : 0.62
Number Major Minor RaidDevice State 0 8 65 0 active sync /dev/sde1 1 8 81 1 active sync /dev/sdf1
I don't see anything wrong here ?
lvm shows no problems I detect either...
# vgdisplay vg1 Volume group "vgdisplay" not found LV VG Attr LSize Origin Snap% Move Log Copy% Convert glassfish vg1 -wi-ao 10.00G lv1 vg1 -wi-ao 97.66G oradata vg1 -wi-ao 30.00G pgdata vg1 -wi-ao 25.00G pgdata_lss_idx vg1 -wi-ao 20.00G pgdata_lss_tab vg1 -wi-ao 20.00G swapper vg1 -wi-ao 3.00G vmware vg1 -wi-ao 50.00G
# pvdisplay /dev/md10 /dev/md11 --- Physical volume --- PV Name /dev/md10 VG Name vg1 PV Size 136.73 GB / not usable 2.31 MB Allocatable yes PE Size (KByte) 4096 Total PE 35003 Free PE 1998 Allocated PE 33005 PV UUID oAgJY7-Tmf7-ac35-KoUH-15uz-Q5Ae-bmFCys
--- Physical volume --- PV Name /dev/md11 VG Name vg1 PV Size 136.73 GB / not usable 2.31 MB Allocatable yes PE Size (KByte) 4096 Total PE 35003 Free PE 2560 Allocated PE 32443 PV UUID A4Qb3P-j5Lr-8ZEv-FjbC-Iczm-QkC8-bqP0zv
2010/2/28 John R Pierce pierce@hogranch.com:
this has never happened to me before, and I'm somewhat at a loss. got a email from the cron thing...
/etc/cron.weekly/99-raid-check:
WARNING: mismatch_cnt is not 0 on /dev/md10 WARNING: mismatch_cnt is not 0 on /dev/md11
ok, md10 and md11 are each raid1's made from 2 x 72GB scsi drives, on a dell 2850 or something dual single-core 3ghz server.
these two md's are in turn a striped LVM volume group
dmesg shows....
md: syncing RAID array md10 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 143374656 blocks. md: syncing RAID array md11 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 143374656 blocks. md: md10: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdc1 disk 1, wo:0, o:1, dev:sdd1 md: md11: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sde1 disk 1, wo:0, o:1, dev:sdf1
I'm not sure what thats telling me. the last thing prior to this in dmesg was when I added a swap to this vg last week.
and mdadm --detail shows...
# mdadm --detail /dev/md10 /dev/md10: Version : 0.90 Creation Time : Wed Oct 8 12:54:48 2008 Raid Level : raid1 Array Size : 143374656 (136.73 GiB 146.82 GB) Used Dev Size : 143374656 (136.73 GiB 146.82 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 10 Persistence : Superblock is persistent
Update Time : Sun Feb 28 04:53:29 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0
UUID : b6da4dc5:c7372d6e:63f32b9c:49fa95f9 Events : 0.84
Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 # mdadm --detail /dev/md11 /dev/md11: Version : 0.90 Creation Time : Wed Oct 8 12:54:57 2008 Raid Level : raid1 Array Size : 143374656 (136.73 GiB 146.82 GB) Used Dev Size : 143374656 (136.73 GiB 146.82 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 11 Persistence : Superblock is persistent
Update Time : Sun Feb 28 11:49:45 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0
UUID : be475cd9:b98ee3ff:d18e668c:a5a6e06b Events : 0.62
Number Major Minor RaidDevice State 0 8 65 0 active sync /dev/sde1 1 8 81 1 active sync /dev/sdf1
I don't see anything wrong here ?
lvm shows no problems I detect either...
# vgdisplay vg1 Volume group "vgdisplay" not found LV VG Attr LSize Origin Snap% Move Log Copy% Convert glassfish vg1 -wi-ao 10.00G lv1 vg1 -wi-ao 97.66G oradata vg1 -wi-ao 30.00G pgdata vg1 -wi-ao 25.00G pgdata_lss_idx vg1 -wi-ao 20.00G pgdata_lss_tab vg1 -wi-ao 20.00G swapper vg1 -wi-ao 3.00G vmware vg1 -wi-ao 50.00G
# pvdisplay /dev/md10 /dev/md11 --- Physical volume --- PV Name /dev/md10 VG Name vg1 PV Size 136.73 GB / not usable 2.31 MB Allocatable yes PE Size (KByte) 4096 Total PE 35003 Free PE 1998 Allocated PE 33005 PV UUID oAgJY7-Tmf7-ac35-KoUH-15uz-Q5Ae-bmFCys
--- Physical volume --- PV Name /dev/md11 VG Name vg1 PV Size 136.73 GB / not usable 2.31 MB Allocatable yes PE Size (KByte) 4096 Total PE 35003 Free PE 2560 Allocated PE 32443 PV UUID A4Qb3P-j5Lr-8ZEv-FjbC-Iczm-QkC8-bqP0zv
maybe this helps: http://www.arrfab.net/blog/?p=199
-- Eero
Am 28.02.2010 22:03, schrieb John R Pierce:
WARNING: mismatch_cnt is not 0 on
Have a look at http://www.arrfab.net/blog/?p=199 It says:
A `echo repair >/sys/block/md0/md/sync_action` followed by a `echo check >/sys/block/md0/md/sync_action` seems to have corrected it. Now `cat /sys/block/md0/md/mismatch_cnt` returns 0 …
Regards,
Peter
On 01/03/10 10:16, Peter Hinse wrote:
Am 28.02.2010 22:03, schrieb John R Pierce:
WARNING: mismatch_cnt is not 0 on
Have a look at http://www.arrfab.net/blog/?p=199 It says:
A `echo repair>/sys/block/md0/md/sync_action` followed by a `echo check>/sys/block/md0/md/sync_action` seems to have corrected it. Now `cat /sys/block/md0/md/mismatch_cnt` returns 0 …
Regards,
Peter
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi,
This is happening specifically because of the way swap works. So the issue will re-appear but it isn't actually anything to worry about. I'd suggest that you remove the particular drive from the list being scanned.
Peter Hinse wrote:
Am 28.02.2010 22:03, schrieb John R Pierce:
WARNING: mismatch_cnt is not 0 on
Have a look at http://www.arrfab.net/blog/?p=199 It says:
A `echo repair >/sys/block/md0/md/sync_action` followed by a `echo check >/sys/block/md0/md/sync_action` seems to have corrected it. Now `cat /sys/block/md0/md/mismatch_cnt` returns 0 …
Thanks. I was trying to figure out how from the mdadm commands (UGH!) to do a scan.
# cat /sys/block/md10/md/mismatch_cnt 8448 # cat /sys/block/md11/md/mismatch_cnt 7296
fugly. Since the mirrors aren't checksummed, can i assume this means there's likely some data messups here?
Anyways, the repair is running on both md10 and md11, i'll check back with my final results...
On 01/03/10 10:23, John R Pierce wrote:
Peter Hinse wrote:
Am 28.02.2010 22:03, schrieb John R Pierce:
WARNING: mismatch_cnt is not 0 on
Have a look at http://www.arrfab.net/blog/?p=199 It says:
A `echo repair>/sys/block/md0/md/sync_action` followed by a `echo check>/sys/block/md0/md/sync_action` seems to have corrected it. Now `cat /sys/block/md0/md/mismatch_cnt` returns 0 …
Thanks. I was trying to figure out how from the mdadm commands (UGH!) to do a scan.
# cat /sys/block/md10/md/mismatch_cnt 8448 # cat /sys/block/md11/md/mismatch_cnt 7296
fugly. Since the mirrors aren't checksummed, can i assume this means there's likely some data messups here?
Anyways, the repair is running on both md10 and md11, i'll check back with my final results...
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi
It has to do with aborted writes in SWAP. Your data should be fine
Clint Dilks wrote:
It has to do with aborted writes in SWAP. Your data should be fine
so swap on LVM on MD mirrors is a bad idea?
frankly, I usually avoid LVM< but I figured I'd setup this system with it and see how it goes. its just a dev box, but we're about to put some oracle stuff on it (for development, but still)
On 01/03/10 10:31, John R Pierce wrote:
Clint Dilks wrote:
It has to do with aborted writes in SWAP. Your data should be fine
so swap on LVM on MD mirrors is a bad idea?
frankly, I usually avoid LVM< but I figured I'd setup this system with it and see how it goes. its just a dev box, but we're about to put some oracle stuff on it (for development, but still)
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
SWAP inside LVM is fine in my experience. Personally I consider this a benign error and generally ignore it unless the mismatch count is very high.
Clint Dilks wrote:
SWAP inside LVM is fine in my experience. Personally I consider this a benign error and generally ignore it unless the mismatch count is very high
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like vxvm, zfs do), I am playing 'data maybe'. As someone who adminstrates database servers, i have a real problem with that.
btw, this is centos 5.4+latest x86_64, its primarily running postgres, and our inhouse java middleware apps. and was going to be a oracle grid operations server.
John R Pierce wrote:
Clint Dilks wrote:
SWAP inside LVM is fine in my experience. Personally I consider this a benign error and generally ignore it unless the mismatch count is very high
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like vxvm, zfs do), I am playing 'data maybe'. As someone who adminstrates database servers, i have a real problem with that.
Turn off the swap and see if the problem goes away. Or move the swap somewhere else than the software RAID and, again, see if the problem goes away. It isn't that hard to localize.
On 01/03/10 11:37, John R Pierce wrote:
Clint Dilks wrote:
SWAP inside LVM is fine in my experience. Personally I consider this a benign error and generally ignore it unless the mismatch count is very high
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like vxvm, zfs do), I am playing 'data maybe'. As someone who adminstrates database servers, i have a real problem with that.
btw, this is centos 5.4+latest x86_64, its primarily running postgres, and our inhouse java middleware apps. and was going to be a oracle grid operations server.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Even if this isn't SWAP the issue is to do with aborted writes. As I understand it. Situations occur where a write is requested, written to 1st drive but then aborted. There are two ways this can be handled. Drive 2 doesn't have the data on it yet so doesn't matter. But Drive1 you can either delete the unwanted block and then mark it as free. Or you can just skip the delete. The second option is what Software Raid does. So the differences being detected all relate to blocks that marked as free for re-use.
On Sun, Feb 28, 2010 at 02:37:13PM -0800, John R Pierce wrote:
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like
This thread is very timely. I updated my C5.3 to 5.4 last week (not sure why it took me so long) and this morning noticed my raid5 was resyncing. 5*1Tbyte disks. The resync took... Feb 28 04:22:02 mercury kernel: md: syncing RAID array md3 Feb 28 16:27:06 mercury kernel: md: md3: sync done.
Performance was bad during this time. Not terrible from an interactive point of view, but a job that normally run from 4am to 10am didn't finish until 3pm.
I like the concept of checking the disks are good, but it really sounds like there are practical problems (false positives, performance degregation) .
So I think /etc/sysconfig/raid-check is going to read ENABLED=no
On Feb 28, 2010, at 7:15 PM, Stephen Harris lists@spuddy.org wrote:
On Sun, Feb 28, 2010 at 02:37:13PM -0800, John R Pierce wrote:
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like
This thread is very timely. I updated my C5.3 to 5.4 last week (not sure why it took me so long) and this morning noticed my raid5 was resyncing. 5*1Tbyte disks. The resync took... Feb 28 04:22:02 mercury kernel: md: syncing RAID array md3 Feb 28 16:27:06 mercury kernel: md: md3: sync done.
Performance was bad during this time. Not terrible from an interactive point of view, but a job that normally run from 4am to 10am didn't finish until 3pm.
I like the concept of checking the disks are good, but it really sounds like there are practical problems (false positives, performance degregation) .
So I think /etc/sysconfig/raid-check is going to read ENABLED=no
It would be nice if the mismatch_cnt could be compared to a count of aborted writes and only resync if they differ, but mismatch_cnt persists and aborted writes is only maintained since last reboot.
Ideally the md raid code needs to make the writes completely atomic, so they either complete on all members or none and not allow an abort task to preempt a write in progress.
-Ross
On 2/28/2010 6:15 PM, Stephen Harris wrote:
On Sun, Feb 28, 2010 at 02:37:13PM -0800, John R Pierce wrote:
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like
This thread is very timely. I updated my C5.3 to 5.4 last week (not sure why it took me so long) and this morning noticed my raid5 was resyncing. 5*1Tbyte disks. The resync took... Feb 28 04:22:02 mercury kernel: md: syncing RAID array md3 Feb 28 16:27:06 mercury kernel: md: md3: sync done.
Performance was bad during this time. Not terrible from an interactive point of view, but a job that normally run from 4am to 10am didn't finish until 3pm.
I like the concept of checking the disks are good, but it really sounds like there are practical problems (false positives, performance degregation) .
So I think /etc/sysconfig/raid-check is going to read ENABLED=no
I agree that is is a fairly surprising behavior change for an 'enterprise' system where the point is mostly to not have surprising behavior changes. On the other hand it is probably a good thing to do if you can make the scheduling fit in.
On Sun, 2010-02-28 at 14:37 -0800, John R Pierce wrote:
Clint Dilks wrote:
SWAP inside LVM is fine in my experience. Personally I consider this a benign error and generally ignore it unless the mismatch count is very high
And how do I know all these mirror data mismatches are Swap? does not each mismatch mean the mirrors disagree, which means one of them is wrong. Which one? since they aren't timestamped or checksummed (like vxvm, zfs do), I am playing 'data maybe'. As someone who adminstrates database servers, i have a real problem with that.
btw, this is centos 5.4+latest x86_64, its primarily running postgres, and our inhouse java middleware apps. and was going to be a oracle grid operations server.
----
Then any reason to not run the PERC in the 2850? Non of my DB machines run SWAP period. My thoughts on Linux swap is if I were to use it, it can't keep up the sync pace, been there.
I realize I say no swap but with MS DesktopEngine and SQL-CEServer I use swap.
John
JohnS wrote:
Then any reason to not run the PERC in the 2850? Non of my DB machines run SWAP period. My thoughts on Linux swap is if I were to use it, it can't keep up the sync pace, been there.
these 2850 have only JBOD SCSI. PERC was an option, i didn't buy these, I inherited them.
I realize I say no swap but with MS DesktopEngine and SQL-CEServer I use swap.
The oracle installer complains if you don't have swap >= main memory. I realize I can choose to ignore that.
On Mon, 2010-03-01 at 13:02 -0800, John R Pierce wrote:
JohnS wrote:
Then any reason to not run the PERC in the 2850? Non of my DB machines run SWAP period. My thoughts on Linux swap is if I were to use it, it can't keep up the sync pace, been there.
these 2850 have only JBOD SCSI. PERC was an option, i didn't buy these, I inherited them.
I realize I say no swap but with MS DesktopEngine and SQL-CEServer I use swap.
The oracle installer complains if you don't have swap >= main memory. I realize I can choose to ignore that.
--- And when I see Oracle I complain also.. :-) God, have you priced Oracle EHR App Stack?
John
On 01/03/10 10:27, Clint Dilks wrote:
On 01/03/10 10:23, John R Pierce wrote:
Peter Hinse wrote:
Am 28.02.2010 22:03, schrieb John R Pierce:
WARNING: mismatch_cnt is not 0 on
Have a look at http://www.arrfab.net/blog/?p=199 It says:
A `echo repair>/sys/block/md0/md/sync_action` followed by a `echo check>/sys/block/md0/md/sync_action` seems to have corrected it. Now `cat /sys/block/md0/md/mismatch_cnt` returns 0 …
Thanks. I was trying to figure out how from the mdadm commands (UGH!) to do a scan.
# cat /sys/block/md10/md/mismatch_cnt 8448 # cat /sys/block/md11/md/mismatch_cnt 7296
fugly. Since the mirrors aren't checksummed, can i assume this means there's likely some data messups here?
Anyways, the repair is running on both md10 and md11, i'll check back with my final results...
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi
It has to do with aborted writes in SWAP. Your data should be fine _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
See http://forum.nginx.org/read.php?24,16699 for more info