mismatch_cnt after 5.3 -> 5.4 upgrade

List overview All Threads
Download

newer

older

CentOS-announce Digest, Vol 56,...

Help! i want to clone my Centos...

Devin Reade

25 Oct 2009 25 Oct '09

6:33 p.m.

Saturday I did an upgrade from 5.3 (original install) to 5.4. Saturday night, /etc/cron.weekly reported the following:

/etc/cron.weekly/99-raid-check:

WARNING: mismatch_cnt is not 0 on /dev/md0

md0 holds /boot and resides, mirrored, on sda1 and sdb1. md1 holds an LVM volume containing the remaining filesytems, including swap.

The underlying hardware is just a few months hold, has passed the usual memtest stuff, and has been running 5.3 well for a few months.

I'm *guessing* that due to the timing, this is related to the upgrade. I have to admit that I forgot myself and instead of doing the glibc updates as recommended, I only did:

yum clean all yum update yum rpm -e --nodeps perl-5.8.8-18.el5_3.1.i386 (see today's perl thread) yum update perl.x86_64 yum update shutdown -r now

I've taken a backup of /boot dump after the upgrade, but have not yet reenabled normal backups.

My hunch is that something in the upgrade process touched sda1 but not sdb1, and that removing sdb1 from the mirror and reattaching it for resync would be sufficient, however I was looking for comments on this from anyone with experience or opinion on the matter. Googling the issue doesn't seem to turn up any recent related results.

Also, could the upgrade have touched the bootblock on sda1 but not sdb1 and thus trigger this problem?

Devin

-- A zygote is a gamete's way of producing more gametes. This may be the purpose of the universe. - Robert Heinlein

Show replies by date

RedShift

25 Oct 25 Oct

6:48 p.m.

Devin Reade wrote:

...

Saturday I did an upgrade from 5.3 (original install) to 5.4. Saturday night, /etc/cron.weekly reported the following:
   /etc/cron.weekly/99-raid-check:

   WARNING: mismatch_cnt is not 0 on /dev/md0
md0 holds /boot and resides, mirrored, on sda1 and sdb1. md1 holds an LVM volume containing the remaining filesytems, including swap.

The underlying hardware is just a few months hold, has passed the usual memtest stuff, and has been running 5.3 well for a few months.

I'm *guessing* that due to the timing, this is related to the upgrade. I have to admit that I forgot myself and instead of doing the glibc updates as recommended, I only did:

yum clean all yum update yum rpm -e --nodeps perl-5.8.8-18.el5_3.1.i386 (see today's perl thread) yum update perl.x86_64 yum update shutdown -r now

I've taken a backup of /boot dump after the upgrade, but have not yet reenabled normal backups.

My hunch is that something in the upgrade process touched sda1 but not sdb1, and that removing sdb1 from the mirror and reattaching it for resync would be sufficient, however I was looking for comments on this from anyone with experience or opinion on the matter. Googling the issue doesn't seem to turn up any recent related results.

Also, could the upgrade have touched the bootblock on sda1 but not sdb1 and thus trigger this problem?

Devin

What exactly is the mismatch_cnt value? If it's not too much, it is most likely coming from your swap partition.

Run a check, if that doesn't fail I wouldn't worry about it.

Glenn

Devin Reade

10:27 p.m.

RedShift redshift@pandora.be wrote:

...

What exactly is the mismatch_cnt value? If it's not too much, it is most likely coming from your swap partition.

128. md0 is /boot only; swap is on md1 which didn't have a problem

Devin

-- A zygote is a gamete's way of producing more gametes. This may be the purpose of the universe. - Robert Heinlein

Ron Loftin

6:52 p.m.

On Sun, 2009-10-25 at 12:33 -0600, Devin Reade wrote:

...

Saturday I did an upgrade from 5.3 (original install) to 5.4. Saturday night, /etc/cron.weekly reported the following:
   /etc/cron.weekly/99-raid-check:

   WARNING: mismatch_cnt is not 0 on /dev/md0

I had this happen on a box that I upgraded Friday. I went ahead and tested each partition in the affected mirror with badblocks ( found no errors ) and after multiple resyncs, there was no change. After similar experiences with Google, I did run across a note saying that this went away after a reboot. I broke down and applied the Micro$lop solution ( reboot ) and the error has gone away.

Like you, I'm interested in a better understanding of this issue, so if anyone else has more info, I'm all ears. ;>

...

md0 holds /boot and resides, mirrored, on sda1 and sdb1. md1 holds an LVM volume containing the remaining filesytems, including swap.

The underlying hardware is just a few months hold, has passed the usual memtest stuff, and has been running 5.3 well for a few months.

I'm *guessing* that due to the timing, this is related to the upgrade. I have to admit that I forgot myself and instead of doing the glibc updates as recommended, I only did:

yum clean all yum update yum rpm -e --nodeps perl-5.8.8-18.el5_3.1.i386 (see today's perl thread) yum update perl.x86_64 yum update shutdown -r now

I've taken a backup of /boot dump after the upgrade, but have not yet reenabled normal backups.

My hunch is that something in the upgrade process touched sda1 but not sdb1, and that removing sdb1 from the mirror and reattaching it for resync would be sufficient, however I was looking for comments on this from anyone with experience or opinion on the matter. Googling the issue doesn't seem to turn up any recent related results.

Also, could the upgrade have touched the bootblock on sda1 but not sdb1 and thus trigger this problem?

Devin

-- Ron Loftin reloftin@twcny.rr.com "God, root, what is difference ?" Piter from UserFriendly

S.Tindall

9:23 p.m.

On Sun, 2009-10-25 at 14:52 -0400, Ron Loftin wrote:

...

On Sun, 2009-10-25 at 12:33 -0600, Devin Reade wrote:

...
Saturday I did an upgrade from 5.3 (original install) to 5.4. Saturday night, /etc/cron.weekly reported the following:
   /etc/cron.weekly/99-raid-check:

   WARNING: mismatch_cnt is not 0 on /dev/md0
I had this happen on a box that I upgraded Friday. I went ahead and tested each partition in the affected mirror with badblocks ( found no errors ) and after multiple resyncs, there was no change. After similar experiences with Google, I did run across a note saying that this went away after a reboot. I broke down and applied the Micro$lop solution ( reboot ) and the error has gone away.

Like you, I'm interested in a better understanding of this issue, so if anyone else has more info, I'm all ears. ;>

mismatch_cnt (/sys/block/md*/md/mismatch_cnt) is the number of unsynchronized blocks in the raid.

The repair is to rebuild the raid:

# echo repair >/sys/block/md<#>/md/sync_action

...which does not reset the count, but if you force a check after the rebuild is complete:

# echo check >/sys/block/md<#>/md/sync_action

...then the count should return to zero.

Or at least that worked for me on three systems today.

Steve

Devin Reade

10:41 p.m.

S.Tindall tindall.satwth@brandxmail.com wrote:

...

mismatch_cnt (/sys/block/md*/md/mismatch_cnt) is the number of unsynchronized blocks in the raid.

Understood.

I did the repair/check on sync_action and it got rid of the problem. (Thanks)

What I _don't_ understand is why they were unsynchronized to begin with (`cat /proc/mdstat` showed the array to be clean). Nor do I understand the mechanism used by the 'repair' mechanism, and why I should believe that it's using the correct data in its sync. Although I've looked around, I've not seen anything that describes how repair works and (specifically for raid1) how it can tell which slice has the good data and which has the bad data.

"Fixing" things without understanding what is going on under the covers (at least conceptually) does not give me a warm fuzzy feeling :/

Devin

-- A zygote is a gamete's way of producing more gametes. This may be the purpose of the universe. - Robert Heinlein

Mogens Kjaer

26 Oct 26 Oct

6:48 a.m.

On 10/25/2009 07:33 PM, Devin Reade wrote: ...

...

    WARNING: mismatch_cnt is not 0 on /dev/md0

I have two machines with software RAID 1 running CentOS, they both gave this message this weekend.

Mogens

-- Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Mobile: +45 22 12 53 25 Email: mk@crc.dk Homepage: http://www.crc.dk

Ryan Wagoner

3:51 p.m.

The /etc/cron.weekly/99-raid-check script is new for 5.4. Read through the mdadm list. You will find that small mismatch counts on RAID 1 is normal. I don't remember the exact reason but it has to do with aborted writes where the queue has already committed the one drive and not the other. Since it is in an unused area of the file system and mdadm can't tell when the aborted write happened it is just left alone. This is why it is common on swap partitions.

Ryan

On Mon, Oct 26, 2009 at 2:48 AM, Mogens Kjaer mk@crc.dk wrote:

...

On 10/25/2009 07:33 PM, Devin Reade wrote: ...

...
WARNING: mismatch_cnt is not 0 on /dev/md0

I have two machines with software RAID 1 running CentOS, they both gave this message this weekend.

Mogens

Mogens Kjaer, Carlsberg A/S, Computer Department Gamle Carlsberg Vej 10, DK-2500 Valby, Denmark Phone: +45 33 27 53 25, Mobile: +45 22 12 53 25 Email: mk@crc.dk Homepage: http://www.crc.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

5835

Age (days ago)

5836

Last active (days ago)

discuss@lists.centos.org

7 comments

6 participants

tags (0)

participants (6)

Devin Reade
Mogens Kjaer
RedShift
Ron Loftin
Ryan Wagoner
S.Tindall