Help. Failed event on md1

List overview All Threads
Download

newer

older

getrichquick crazy user's email...

Whereis system-config-kickstart

Fajar Priyanto

14 Mar 2006 14 Mar '06

3:23 a.m.

Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.

In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]

unused devices: <none>

Pls help me. What should I do? Thank you very much,

-- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 10:23:49 up 1:54, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org

Show replies by date

Sudev Barar

14 Mar 14 Mar

4:32 a.m.

On 14/03/06, Fajar Priyanto fajarpri@cbn.net.id wrote:

...

Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.

In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]

IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything. -- Sudev Barar Learning Linux

Fajar Priyanto

7:27 a.m.

On Tuesday 14 March 2006 11:32 am, Sudev Barar wrote:

...

IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything.

I apologize. I guess the panic hit me :):)

Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2

And then re-added it: mdadm /dev/md1 -a /dev/sdb2

It's rebuilding again. Pheew.. my first failed RAID event. Thanks.

-- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 14:27:53 up 5:58, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org

Dag Wieers

7:47 a.m.

On Tue, 14 Mar 2006, Fajar Priyanto wrote:

...

On Tuesday 14 March 2006 11:32 am, Sudev Barar wrote:

...
IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything.

I apologize. I guess the panic hit me :):)

Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2

And then re-added it: mdadm /dev/md1 -a /dev/sdb2

It's rebuilding again. Pheew.. my first failed RAID event.

If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.

smartctl -a /dev/sdb

And if you weren't already running smartd, now is a good time to check the smartd configuration and verify it has all disks configured as well :)

[dag@lxrh002 dag]# cat /etc/smartd.conf /dev/hda -H -m root@localhost.localdomain /dev/hdc -H -m root@localhost.localdomain

and then enable it:

chkconfig smart on service smart start

This will make sure you won't get any additional sudden surprises.

Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]

Fajar Priyanto

8:21 a.m.

On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:

...

On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.

It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.

-- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 15:21:27 up 6:51, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org

Dag Wieers

10:21 a.m.

On Tue, 14 Mar 2006, Fajar Priyanto wrote:

...

On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:

...
On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.

It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.

When you can bring the machine down, you might want to boot a recent kernel (>= 2.6.15) and test the disk with smartctl. (knoppix probably will not work, but a recent live-cd that has 2.6.15 and smartctl).

FC4 with the 2.6.15 kernel worked. You'll probably have to wait until EL5 for smartctl libata support in CentOS.

Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]

Dag Wieers

17 Mar 17 Mar

6:32 a.m.

On Tue, 14 Mar 2006, Dag Wieers wrote:

...

On Tue, 14 Mar 2006, Fajar Priyanto wrote:

...
On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:

...
On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.

It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.

When you can bring the machine down, you might want to boot a recent kernel (>= 2.6.15) and test the disk with smartctl. (knoppix probably will not work, but a recent live-cd that has 2.6.15 and smartctl).

FC4 with the 2.6.15 kernel worked. You'll probably have to wait until EL5 for smartctl libata support in CentOS.

As has emerged from the '[CentOS] SMART for SATA devices ?' thread, the just released 2.6.9-34 kernel that ships with EL4 U3 supports smart over libata (Red Hat backported this from 2.6.15) by using something like:

smartctl -a -d ata /dev/sda

I'm mentioning it here for future Google reference :)

Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]

Eduardo Grosclaude

14 Mar 14 Mar

7:52 a.m.

I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?

2006/3/14, Fajar Priyanto fajarpri@cbn.net.id:

...

...
back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is

It's rebuilding again. Pheew.. my first failed RAID event.

Eduardo Grosclaude Universidad Nacional del Comahue Neuquen, Argentina

Fajar Priyanto

8:22 a.m.

On Tuesday 14 March 2006 02:52 pm, Eduardo Grosclaude wrote:

...

I'd like to know of a way to deliberately (though non-hardware-destructively

:) ) cause such failures for educational purposes. Can anybody suggest one?

2006/3/14, Fajar Priyanto fajarpri@cbn.net.id:

...
...
back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is

It's rebuilding again. Pheew.. my first failed RAID event.

Well, I'm not sure what caused it. The server isn't rebooted. The kernel only said that on 5am this morning it encountered error on sdb2 that it cannot read it.

-- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 15:22:40 up 6:53, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org

Aleksandar Milivojevic

3:27 p.m.

Quoting Fajar Priyanto fajarpri@cbn.net.id:

...

Well, I'm not sure what caused it. The server isn't rebooted. The kernel only said that on 5am this morning it encountered error on sdb2 that it cannot read it.

Most likely you got some bad sectors on it. As I wrote earlier, this is one of the first signs something is wrong with the disk. Start planning purchace of new disk. You can run diagnostics on the disk (usually downloadable from disk manufacturers support web site) to check the health of the disk and reallocate any bad sectors found. However, in some cases once you start getting first bad sectors, they just keep multiplying exponentially (if they are caused by loose particles inside the disk flying around as disk spins).

If disk is still under warranty, I'd contact manufacturer immediately (like right now, this second). Once I made a mistake of "fixing" the problem couple of months before warranty expired. And it seemed to work OK, no more errors or such. Disk failed completely couple of months later, just after warranty expired. Of course, manufacturer (Fujitsu to name it) declined my warranty request (I'm not blaming them here, it was my stupid mistake that I haven't contacted them as soon as I saw the first error message logged by the kernel).

-- See Ya' later, alligator! http://www.8-P.ca/

Les Mikesell

2:02 p.m.

On Tue, 2006-03-14 at 01:52, Eduardo Grosclaude wrote:

...

I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?

mdadm /dev/mdn --fail /dev/hdnn where the n's specify the array, drive and partition. You have to do that before you can --remove it.

-- Les Mikesell lesmikesell@gmail.com

Aleksandar Milivojevic

3:13 p.m.

Quoting Eduardo Grosclaude eduardo.grosclaude@gmail.com:

...

I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?

Replace device names as needed:

# mdadm /dev/md0 -f /dev/sda1

-- See Ya' later, alligator! http://www.8-P.ca/

Aleksandar Milivojevic

3:10 p.m.

Quoting Fajar Priyanto fajarpri@cbn.net.id:

...

I apologize. I guess the panic hit me :):)

Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2

And then re-added it: mdadm /dev/md1 -a /dev/sdb2

It's rebuilding again. Pheew.. my first failed RAID event.

This can happen sometimes when machine crashes. However, I would still recheck log files to see if there were any warrnings/errors on /dev/sdb (these are usually the very first signs that disk might fail soon). Keep an eye on that drive, and recheck logs (/var/log/messages) from time to time. If you see any errors reported on it, replace it immediattely.

-- See Ya' later, alligator! http://www.8-P.ca/

Fajar Priyanto

3:49 p.m.

On Tuesday 14 March 2006 10:10 pm, Aleksandar Milivojevic wrote:

...

This can happen sometimes when machine crashes. However, I would still recheck log files to see if there were any warrnings/errors on /dev/sdb (these are usually the very first signs that disk might fail soon). Keep an eye on that drive, and recheck logs (/var/log/messages) from time to time. If you see any errors reported on it, replace it immediattely.

Yes, I will monitor the disk very closely. Thank you very much Aleksandar.

-- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 22:49:34 up 1:05, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org

Benjamin Smith

9:08 a.m.

If you don't know how to hotadd/remove partitions, you really shouldn't be running a RAID array.

Get boned up quickly! It's not that hard! http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html

-Ben

On Monday 13 March 2006 19:23, Fajar Priyanto wrote:

...

Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.

In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]

unused devices: <none>

Pls help me. What should I do? Thank you very much, -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 10:23:49 up 1:54, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.

-- "The best way to predict the future is to invent it." - XEROX PARC slogan, circa 1978 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.

7057

Age (days ago)

7060

Last active (days ago)

discuss@lists.centos.org

14 comments

7 participants

tags (0)

participants (7)

Aleksandar Milivojevic
Benjamin Smith
Dag Wieers
Eduardo Grosclaude
Fajar Priyanto
Les Mikesell
Sudev Barar