odd mdadm behavior

List overview All Threads
Download

newer

older

Upgrade frim 6.0 to 6.1 crashes on...

Squid to Cache RPMs from yum (was:...

aurfalien＠gmail.com

19 Dec 2011 19 Dec '11

8:29 p.m.

Allo esteemed Centos-ers,

Noticed something funny with an mdadm mirror based raid the other day.

So I had a system disk set to mirror via mdadm.

One of the disks went south at a remote office and since there was no one available to swap out the disk, I thought to leave it for later.

Well, due to work being what is it, this later became a year.

During one of the servers many reboots, much of the data on the system went missing. The logs even showed a gap from the day the disk went bad to the day this reboot occurred.

So picture a gap in the logs of about a year or so.

I first suspected foul play but soon discovered that the previous bad disk in the mirror became functional again and that its data basically put the server back to the time period being a year ago or so.

Not a big deal as I do 2 types of backups every day so recovery was fine.

Has any one seen this before and what could I have done to prevent this?

Thanks in advance,

- aurf

Show replies by date

Ljubomir Ljubojevic

19 Dec 19 Dec

8:45 p.m.

Vreme: 12/19/2011 09:29 PM, aurfalien@gmail.com piše:

...

Allo esteemed Centos-ers,

Noticed something funny with an mdadm mirror based raid the other day.

So I had a system disk set to mirror via mdadm.

One of the disks went south at a remote office and since there was no one available to swap out the disk, I thought to leave it for later.

Well, due to work being what is it, this later became a year.

During one of the servers many reboots, much of the data on the system went missing. The logs even showed a gap from the day the disk went bad to the day this reboot occurred.

So picture a gap in the logs of about a year or so.

I first suspected foul play but soon discovered that the previous bad disk in the mirror became functional again and that its data basically put the server back to the time period being a year ago or so.

Not a big deal as I do 2 types of backups every day so recovery was fine.

Has any one seen this before and what could I have done to prevent this?

Thanks in advance,

aurf

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

I had similar thing happening to me just few weeks ago. What happened to me was that raid broke and data was being written only to master disk, that started acting up. So reboot 1 month after the raid breakage brought 1 month old data since that other disk came online and raid master.

I fixed the problem and will better watch out for bad signs. And I am going to reinstall it from 5.6 (now 5.7) to 6.2.

-- Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman... StarOS, Mikrotik and CentOS/RHEL/Linux consultant

Les Mikesell

20 Dec 20 Dec

1:01 a.m.

On Mon, Dec 19, 2011 at 2:45 PM, Ljubomir Ljubojevic office@plnet.rs wrote:

...

...
Noticed something funny with an mdadm mirror based raid the other day.

So I had a system disk set to mirror via mdadm.

One of the disks went south at a remote office and since there was no one available to swap out the disk, I thought to leave it for later.

Well, due to work being what is it, this later became a year.

During one of the servers many reboots, much of the data on the system went missing. The logs even showed a gap from the day the disk went bad to the day this reboot occurred.

So picture a gap in the logs of about a year or so.

I first suspected foul play but soon discovered that the previous bad disk in the mirror became functional again and that its data basically put the server back to the time period being a year ago or so.

I had similar thing happening to me just few weeks ago. What happened to me was that raid broke and data was being written only to master disk, that started acting up. So reboot 1 month after the raid breakage brought 1 month old data since that other disk came online and raid master.

I fixed the problem and will better watch out for bad signs. And I am going to reinstall it from 5.6 (now 5.7) to 6.2.

That's not supposed to happen. I've had disks get kicked out and they either resync automatically with the one with the newest data as the master or they don't resync untill you do it manually. Is this a recent regression or have I just been lucky?

-- Les Mikesell lesmikesell@gmail.com

Paul Heinlein

19 Dec 19 Dec

9:38 p.m.

On Mon, 19 Dec 2011, aurfalien@gmail.com wrote:

...

Allo esteemed Centos-ers,

Noticed something funny with an mdadm mirror based raid the other day.

So I had a system disk set to mirror via mdadm.

One of the disks went south at a remote office and since there was no one available to swap out the disk, I thought to leave it for later.

I'm interested to know if you used mdadm to fail and remove the bad disk from the array when it first started acting up.

-- Paul Heinlein <> heinlein@madboa.com <> http://www.madboa.com/

aurfalien＠gmail.com

9:43 p.m.

On Dec 19, 2011, at 1:38 PM, Paul Heinlein wrote:

...

On Mon, 19 Dec 2011, aurfalien@gmail.com wrote:

...
Allo esteemed Centos-ers,

Noticed something funny with an mdadm mirror based raid the other day.

So I had a system disk set to mirror via mdadm.

One of the disks went south at a remote office and since there was no one available to swap out the disk, I thought to leave it for later.

I'm interested to know if you used mdadm to fail and remove the bad disk from the array when it first started acting up.

No, I should have but left it alone.

I know, my bad.

- aurf

Paul Heinlein

9:58 p.m.

On Mon, 19 Dec 2011, aurfalien@gmail.com wrote:

...

...
I'm interested to know if you used mdadm to fail and remove the bad disk from the array when it first started acting up.

No, I should have but left it alone.

I know, my bad.

I was merely interested.

Recently I had a RAID-1 device get marked as bad, but I couldn't see any SMART errors. So I failed, removed, and then re-added the device. It worked for about a week, then it failed again, but time the SMART errors were obvious.

I'd ordered a new drive at the first failure, so I was ready when it failed the second time.

I guess the point is that I've seen "bad" drives go "good" again, at least for short periods of time.

-- Paul Heinlein <> heinlein@madboa.com <> http://www.madboa.com/

Ljubomir Ljubojevic

20 Dec 20 Dec

1:30 a.m.

Vreme: 12/19/2011 10:58 PM, Paul Heinlein piše:

...

On Mon, 19 Dec 2011, aurfalien@gmail.com wrote:

...
...
I'm interested to know if you used mdadm to fail and remove the bad disk from the array when it first started acting up.

No, I should have but left it alone.

I know, my bad.

I was merely interested.

Recently I had a RAID-1 device get marked as bad, but I couldn't see any SMART errors. So I failed, removed, and then re-added the device. It worked for about a week, then it failed again, but time the SMART errors were obvious.

I'd ordered a new drive at the first failure, so I was ready when it failed the second time.

I guess the point is that I've seen "bad" drives go "good" again, at least for short periods of time.

Les, I will reply to your mail here.

This would better explain my troubles. I've had to manually re-add them few times, but I've had little experience and I thought these things happen, never gave it much thought.

I am still not 100% on what actually happened, but my guess would be that since boot partition was active only on one drive, and that one was creating problems, I went with line of easier resistance and just patch it up. I am going to use C6 ability to boot from full raid partition, and maybe even add IDE DOM module for boot partition.

I am now good and watching like a hawk.

5049

Age (days ago)

5050

Last active (days ago)

discuss@lists.centos.org

6 comments

4 participants

tags (0)

participants (4)

aurfalien＠gmail.com
Les Mikesell
Ljubomir Ljubojevic
Paul Heinlein