Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.
In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]
md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]
unused devices: <none>
Pls help me. What should I do? Thank you very much,
On 14/03/06, Fajar Priyanto fajarpri@cbn.net.id wrote:
Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.
In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]
md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]
IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything. -- Sudev Barar Learning Linux
On Tuesday 14 March 2006 11:32 am, Sudev Barar wrote:
IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything.
I apologize. I guess the panic hit me :):)
Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2
And then re-added it: mdadm /dev/md1 -a /dev/sdb2
It's rebuilding again. Pheew.. my first failed RAID event. Thanks.
On Tue, 14 Mar 2006, Fajar Priyanto wrote:
On Tuesday 14 March 2006 11:32 am, Sudev Barar wrote:
IMHO on md1 component sdb2 of raid is failing. Try rebuild using mdadm. First remove the raid component sdb2 from md1 and then add it back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is likely your disk is developing errors. Be aware! Carefully man mdadm for details before you do anything.
I apologize. I guess the panic hit me :):)
Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2
And then re-added it: mdadm /dev/md1 -a /dev/sdb2
It's rebuilding again. Pheew.. my first failed RAID event.
If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.
smartctl -a /dev/sdb
And if you weren't already running smartd, now is a good time to check the smartd configuration and verify it has all disks configured as well :)
[dag@lxrh002 dag]# cat /etc/smartd.conf /dev/hda -H -m root@localhost.localdomain /dev/hdc -H -m root@localhost.localdomain
and then enable it:
chkconfig smart on service smart start
This will make sure you won't get any additional sudden surprises.
Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]
On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:
On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.
It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.
On Tue, 14 Mar 2006, Fajar Priyanto wrote:
On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:
On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.
It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.
When you can bring the machine down, you might want to boot a recent kernel (>= 2.6.15) and test the disk with smartctl. (knoppix probably will not work, but a recent live-cd that has 2.6.15 and smartctl).
FC4 with the 2.6.15 kernel worked. You'll probably have to wait until EL5 for smartctl libata support in CentOS.
Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]
On Tue, 14 Mar 2006, Dag Wieers wrote:
On Tue, 14 Mar 2006, Fajar Priyanto wrote:
On Tuesday 14 March 2006 02:47 pm, Dag Wieers wrote:
On Tue, 14 Mar 2006, Fajar Priyanto wrote: If this is a normal SCSI disk (ie. not SATA), I would use smartctl to check if this disk has errors.
It is a SATA disks :( So, when I run smartd, it says that currently it doesn't support SATA.
When you can bring the machine down, you might want to boot a recent kernel (>= 2.6.15) and test the disk with smartctl. (knoppix probably will not work, but a recent live-cd that has 2.6.15 and smartctl).
FC4 with the 2.6.15 kernel worked. You'll probably have to wait until EL5 for smartctl libata support in CentOS.
As has emerged from the '[CentOS] SMART for SATA devices ?' thread, the just released 2.6.9-34 kernel that ships with EL4 U3 supports smart over libata (Red Hat backported this from 2.6.15) by using something like:
smartctl -a -d ata /dev/sda
I'm mentioning it here for future Google reference :)
Kind regards, -- dag wieers, dag@wieers.com, http://dag.wieers.com/ -- [all I want is a warm bed and a kind word and unlimited power]
I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?
2006/3/14, Fajar Priyanto fajarpri@cbn.net.id:
back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is
It's rebuilding again. Pheew.. my first failed RAID event.
Eduardo Grosclaude Universidad Nacional del Comahue Neuquen, Argentina
On Tuesday 14 March 2006 02:52 pm, Eduardo Grosclaude wrote:
I'd like to know of a way to deliberately (though non-hardware-destructively
:) ) cause such failures for educational purposes. Can anybody suggest one?
2006/3/14, Fajar Priyanto fajarpri@cbn.net.id:
back again. Some times the raid parity fails due to improper shutdowns and when you re-build it is restored properly. If it does not it is
It's rebuilding again. Pheew.. my first failed RAID event.
Well, I'm not sure what caused it. The server isn't rebooted. The kernel only said that on 5am this morning it encountered error on sdb2 that it cannot read it.
Quoting Fajar Priyanto fajarpri@cbn.net.id:
Well, I'm not sure what caused it. The server isn't rebooted. The kernel only said that on 5am this morning it encountered error on sdb2 that it cannot read it.
Most likely you got some bad sectors on it. As I wrote earlier, this is one of the first signs something is wrong with the disk. Start planning purchace of new disk. You can run diagnostics on the disk (usually downloadable from disk manufacturers support web site) to check the health of the disk and reallocate any bad sectors found. However, in some cases once you start getting first bad sectors, they just keep multiplying exponentially (if they are caused by loose particles inside the disk flying around as disk spins).
If disk is still under warranty, I'd contact manufacturer immediately (like right now, this second). Once I made a mistake of "fixing" the problem couple of months before warranty expired. And it seemed to work OK, no more errors or such. Disk failed completely couple of months later, just after warranty expired. Of course, manufacturer (Fujitsu to name it) declined my warranty request (I'm not blaming them here, it was my stupid mistake that I haven't contacted them as soon as I saw the first error message logged by the kernel).
On Tue, 2006-03-14 at 01:52, Eduardo Grosclaude wrote:
I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?
mdadm /dev/mdn --fail /dev/hdnn where the n's specify the array, drive and partition. You have to do that before you can --remove it.
Quoting Eduardo Grosclaude eduardo.grosclaude@gmail.com:
I'd like to know of a way to deliberately (though non-hardware-destructively :) ) cause such failures for educational purposes. Can anybody suggest one?
Replace device names as needed:
# mdadm /dev/md0 -f /dev/sda1
Quoting Fajar Priyanto fajarpri@cbn.net.id:
I apologize. I guess the panic hit me :):)
Fortunately, after I remove the device: mdadm /dev/md1 -r /dev/sdb2
And then re-added it: mdadm /dev/md1 -a /dev/sdb2
It's rebuilding again. Pheew.. my first failed RAID event.
This can happen sometimes when machine crashes. However, I would still recheck log files to see if there were any warrnings/errors on /dev/sdb (these are usually the very first signs that disk might fail soon). Keep an eye on that drive, and recheck logs (/var/log/messages) from time to time. If you see any errors reported on it, replace it immediattely.
On Tuesday 14 March 2006 10:10 pm, Aleksandar Milivojevic wrote:
This can happen sometimes when machine crashes. However, I would still recheck log files to see if there were any warrnings/errors on /dev/sdb (these are usually the very first signs that disk might fail soon). Keep an eye on that drive, and recheck logs (/var/log/messages) from time to time. If you see any errors reported on it, replace it immediattely.
Yes, I will monitor the disk very closely. Thank you very much Aleksandar.
If you don't know how to hotadd/remove partitions, you really shouldn't be running a RAID array.
Get boned up quickly! It's not that hard! http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html
-Ben
On Monday 13 March 2006 19:23, Fajar Priyanto wrote:
Hi all, This morning I received this notification from mdadm: This is an automatically generated mail message from mdadm running on server-mail.mydomain.kom A Fail event had been detected on md device /dev/md1. Faithfully yours, etc.
In /proc/mdstat I see this: Personalities : [raid1] md1 : active raid1 sdb2[2](F) sda2[0] 77842880 blocks [2/1] [U_]
md0 : active raid1 sdb1[1] sda1[0] 305088 blocks [2/2] [UU]
unused devices: <none>
Pls help me. What should I do? Thank you very much, -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 10:23:49 up 1:54, 2.6.15-1.1830_FC4 GNU/Linux Let's use OpenOffice. http://www.openoffice.org _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.