we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). I saw /var/log/messages file have:
Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 Predictive Failure reported: Physical Disk 1:5 Controller 0, Connector 1 Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 Physical disk degraded: Physical Disk 1:5 Controller 0, Connector 1
I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure".
I also use DELL OPMN to check virtual disk and it show "online", not "degrade".
my questions are:
1. is this disk really "degrade" or not?
2. how O.S. can predicate disk going to failure?
3. do I need replace this disk now?
Thanks.
______________________________________________________________________________________________________ 付費才容量無上限?Yahoo!奇摩電子信箱2.0免費給你,信件永遠不必刪! http://tw.mg0.mail.yahoo.com/dc/landing
On 08/04/2009 01:27 PM, mcclnx mcc wrote: ...
- is this disk really "degrade" or not?
Disks aren't in degraded mode. The RAID system will run in degraded mode when the disk eventually fails.
- how O.S. can predicate disk going to failure?
The disk's SMART feature tells it so.
- do I need replace this disk now?
That would be a good idea, the disk could fail in 5 minutes or in 5 month, you can't tell.
Mogens
On Tue, Aug 04, 2009 at 01:38:27PM +0200, Mogens Kjaer wrote:
- do I need replace this disk now?
That would be a good idea, the disk could fail in 5 minutes or in 5 month, you can't tell.
Or, indeed, 5 years. I have a number of "throwaway" workstations at one customer site -- throwaway in that if the disk or system fails, we just rebuild it, and away it goes. Several have been telling me about SMART warnings for YEARS. My experience seems to echo the Google study from a few years back, where SMART wasn't an accurate predictor of disk failure -- some drives SMART then fail, some SMART for years, and some just fail.
So the answer is "it depends". If getting a replacement is likely to be tricky (ie more than a two or three hour wait), or if the data being stored is highly valuable, then AT LEAST get a spare on site and sit it next to or on the system in question. If the data is extremely highly valuable, do the swap now.
But if you don't care about the data, and/or can tollerate some downtime, don't worry about it.
Backups *are* good, right? :)
From: mcclnx mcc mcclnx@yahoo.com.tw
- how O.S. can predicate disk going to failure?
As I understand it, disks can handle a certain amount of bad 'sectors', thanks to some hidden extra space. When a 'sector' fails, the disk marks it as 'bad' and then map it to a 'sector' from the hidden space. As this extra space is not infinite, after a few months/years, there won't be any spare 'sector' left. So, the disk says to the OS: "I will soon fail!"; as in "I am running out of spare 'sectors' so I won't be able to cope with bad 'sectors'". Not sure if it is the case for all vendors...
JD
2009/8/4 mcclnx mcc mcclnx@yahoo.com.tw: [snip]
my questions are:
is this disk really "degrade" or not?
how O.S. can predicate disk going to failure?
do I need replace this disk now?
I understand that the drive electronics can check things such as the time it takes to read a sector, number of failures per read, retries before success etc.. This information gets processed and reported to the OS via SMART as some others have replied.
But I really just want to say that within one day of getting SMART errors, my disk failed.
On Tuesday 04 August 2009 14:48:03 Kwan Lowe wrote:
2009/8/4 mcclnx mcc mcclnx@yahoo.com.tw: [snip]
my questions are:
is this disk really "degrade" or not?
how O.S. can predicate disk going to failure?
do I need replace this disk now?
I understand that the drive electronics can check things such as the time it takes to read a sector, number of failures per read, retries before success etc.. This information gets processed and reported to the OS via SMART as some others have replied.
But I really just want to say that within one day of getting SMART errors, my disk failed.
FWIW, Smart started noting problems on my laptop drive, and the manufacturer accepted a Smart report as sufficient to warrant a replacement drive being fitted under warranty.
Anne
Disks are cheap, your data is not. Replace the disk without hesitation.
2009/8/4 mcclnx mcc mcclnx@yahoo.com.tw:
we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). I saw /var/log/messages file have:
Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 Predictive Failure reported: Physical Disk 1:5 Controller 0, Connector 1 Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 Physical disk degraded: Physical Disk 1:5 Controller 0, Connector 1
I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure".
I also use DELL OPMN to check virtual disk and it show "online", not "degrade".
my questions are:
is this disk really "degrade" or not?
how O.S. can predicate disk going to failure?
do I need replace this disk now?
Thanks.
______________________________________________________________________________________________________
付費才容量無上限?Yahoo!奇摩電子信箱2.0免費給你,信件永遠不必刪! http://tw.mg0.mail.yahoo.com/dc/landing _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
mcclnx mcc wrote:
- is this disk really "degrade" or not?
Depends on your point of view, to me it would be. I remember two situations with "predictive" failure on HP Smart arrays a few years ago where the drives were practically dead but the controller kept using them dragging performance down something like 90%. The drives were detected as about to fail but there was no way to remove/disable the disk from the array remotely, so we had to send someone on site to yank the disk to force the array to rebuild. HP later said a firmware update should fix the issue, never got around to upgrading it before we migrated off those systems onto a real SAN.
- how O.S. can predicate disk going to failure?
In this case it's not the OS, it's the controller that is keeping track of a bunch of internal counters on the disk and perhaps even scrubbing it every so often. If # of soft errors exceeds a threshold it triggers the predictive failure logic.
- do I need replace this disk now?
Based on my past experience yes, and any enterprise storage array's support contract(for comparison) will trigger an immediate replacement if the array detects that condition.
nate
2009/8/4 mcclnx mcc mcclnx@yahoo.com.tw:
we have CENTOS 4.X on DELL server and one one of virtual disk include 4 disk configure as REID5 (one more disk for hot spare). I saw /var/log/messages file have:
Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2094 Predictive Failure reported: Physical Disk 1:5 Controller 0, Connector 1 Aug 4 06:27:02 host1 Server Administrator: Storage Service EventID: 2051 Physical disk degraded: Physical Disk 1:5 Controller 0, Connector 1
I use DELL OPMN to check and found "disk 1:5" still "online", but "predicate failure".
I also use DELL OPMN to check virtual disk and it show "online", not "degrade".
my questions are:
is this disk really "degrade" or not?
how O.S. can predicate disk going to failure?
There are several possibilities, the most likely being that the drive's predict-fail bit flipped, or the # of sectors available for reallocation dropped below the manufacturer's recommended threshold. The following links have additional information on drive failures, and how to use SMART data to look at drive health:
http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
http://prefetch.net/articles/diskdrives.smart.html
Thanks, - Ryan -- http://prefetch.net