My wife's box has a very intermittent problem, when booting from the Maxtor IDE hard drive. This has been going on for about 2 1/2 years.... The box is a Compaq EVO D300v for the Enterprise. When it boots, there is a SMART advisory from the BIOS that says failure is immenient. Occasionally, it will not boot, because the BIOS does not see the hard drive. I replaced the EIDE cable, but the problem of sometimes not seeing the hard drive on boot continues. I suspect it has to do with something loose in the electronics of the drive, because if I press on both ends of the EIDE cable, the problem goes away and then it will boot OK. The box is currently M$ Windows only. I just booted it from my Knoppix Live CD and ran smartctl on it. Below are the results. When I ran the Maxtor Diagnostics on the hard drive, 3 times, each time the quick 90 second SMART test said that I should run the Read test, which takes about one hour. Each time I ran the Read test, it passed OK. 3 times. Should I suggest to my wife that she let me replace the hard drive, now, at her convenience, before it fails completely? Is there anything in the smartctl results that indicates that is not the appropriate thing to do, considering the length of time this problem has existed? I did not run the Maxtor Burn In test or low level format, because I do not want to reinstall everything on this hard drive. The smartctl results certainly seem to indicate something badly awry, which the Maxtor Diagnostics, on the Read only test, did not pick up. TIA! Lanny
root@Knoppix:~# smartctl -d ata -H /dev/hda smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 10 Spin_Retry_Count 0x002b 222 215 223 Pre-fail Always FAILING_NOW 29
root@Knoppix:~# smartctl -d ata -a /dev/hda smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION === Model Family: Maxtor DiamondMax D540X-4D Device Model: Maxtor 4D080H4 Serial Number: D40SBVYE Firmware Version: DAH017K0 User Capacity: 81,964,302,336 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 0 Local Time is: Sun May 24 17:44:28 2009 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. See vendor-specific Attribute list for failed Attributes.
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 64) The previous self-test completed having a test element that failed and the test element that failed is not known. Total time to complete Offline data collection: ( 30) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 51) minutes.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 202 199 063 Pre-fail Always - 18883 4 Start_Stop_Count 0x0032 252 252 000 Old_age Always - 2809 5 Reallocated_Sector_Ct 0x0033 239 239 063 Pre-fail Always - 37 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 250 243 187 Pre-fail Always - 47480 9 Power_On_Minutes 0x0032 253 250 000 Old_age Always - 0h+18m 10 Spin_Retry_Count 0x002b 222 215 223 Pre-fail Always FAILING_NOW 29 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 249 249 000 Old_age Always - 1722 192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0 194 Unknown_Attribute 0x0032 253 253 000 Old_age Always - 0 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0008 251 251 000 Old_age Offline - 2 197 Current_Pending_Sector 0x0008 253 249 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 0 202 TA_Increase_Count 0x000a 253 251 000 Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 0 204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0 207 Spin_High_Current 0x002a 239 235 000 Old_age Always - 13 208 Spin_Buzz 0x002a 245 242 000 Old_age Always - 8 209 Offline_Seek_Performnce 0x0024 253 253 000 Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0
SMART Error Log Version: 1 Warning: ATA error count 3379 inconsistent with error log pointer 5
ATA Error Count: 3379 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3379 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 01 a5 5a a0
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 08 d6 01 01 a5 5a a0 02 03:14:14.480 DEVICE RESET b0 d6 01 9f 4f c2 a0 00 03:12:29.984 SMART WRITE LOG b0 d5 01 9f 4f c2 a0 00 03:12:29.968 SMART READ LOG b0 d6 01 50 4f c2 a0 00 03:12:26.512 SMART WRITE LOG b0 d9 01 00 4f c2 a0 00 03:12:26.480 SMART DISABLE OPERATIONS
Error 3378 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 0b 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d6 01 9f 4f c2 a0 00 03:12:29.984 SMART WRITE LOG b0 d5 01 9f 4f c2 a0 00 03:12:29.968 SMART READ LOG b0 d6 01 50 4f c2 a0 00 03:12:26.512 SMART WRITE LOG b0 d9 01 00 4f c2 a0 00 03:12:26.480 SMART DISABLE OPERATIONS b0 d6 01 50 4f c2 a0 00 03:12:26.416 SMART WRITE LOG
Error 3377 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 0b 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d5 01 9f 4f c2 a0 00 03:12:29.968 SMART READ LOG b0 d6 01 50 4f c2 a0 00 03:12:26.512 SMART WRITE LOG b0 d9 01 00 4f c2 a0 00 03:12:26.480 SMART DISABLE OPERATIONS b0 d6 01 50 4f c2 a0 00 03:12:26.416 SMART WRITE LOG 41 ff 00 00 b9 8a e9 00 03:12:26.416 READ VERIFY SECTOR(S) [OBS-5]
Error 3376 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 0b 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d6 01 50 4f c2 a0 00 03:12:26.512 SMART WRITE LOG b0 d9 01 00 4f c2 a0 00 03:12:26.480 SMART DISABLE OPERATIONS b0 d6 01 50 4f c2 a0 00 03:12:26.416 SMART WRITE LOG 41 ff 00 00 b9 8a e9 00 03:12:26.416 READ VERIFY SECTOR(S) [OBS-5] 41 ff 00 00 b8 8a e9 00 03:12:26.400 READ VERIFY SECTOR(S) [OBS-5]
Error 3375 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 0b 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d6 01 50 4f c2 a0 00 03:12:26.416 SMART WRITE LOG 41 ff 00 00 b9 8a e9 00 03:12:26.416 READ VERIFY SECTOR(S) [OBS-5] 41 ff 00 00 b8 8a e9 00 03:12:26.400 READ VERIFY SECTOR(S) [OBS-5] 41 ff 00 00 b7 8a e9 00 03:12:26.400 READ VERIFY SECTOR(S) [OBS-5] 41 ff 00 00 b6 8a e9 00 03:12:26.400 READ VERIFY SECTOR(S) [OBS-5]
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: unknown failure 00% 0 - # 2 Short offline Completed: unknown failure 00% 0 - # 3 Short offline Completed: unknown failure 00% 997 - # 4 Short offline Completed without error 00% 905 - # 5 Short offline Completed without error 00% 664 - # 6 Short offline Completed without error 00% 664 - # 7 Short offline Completed: unknown failure 00% 1 - # 8 Short offline Completed: unknown failure 00% 9 - # 9 Short offline Completed: unknown failure 00% 215 - #10 Short offline Completed without error 00% 215 - #11 Extended offline Completed without error 00% 213 - #12 Short offline Completed: read failure 60% 187 80417451 #13 Extended offline Completed: read failure 20% 184 80417451 #14 Short offline Completed without error 00% 181 - #15 Extended offline Completed: read failure 20% 151 80417451 #16 Short offline Completed without error 00% 151 - #17 Short offline Completed without error 00% 139 - #18 Short offline Completed: read failure 60% 45 208052 #19 Short offline Completed without error 00% 5 - #20 Extended offline Completed without error 00% 4 - #21 Short offline Completed without error 00% 3 -
Device does not support Selective Self Tests/Logging root@Knoppix:~#
On Sun, May 24, 2009 at 3:10 PM, Lanny Marcus lmmailinglists@gmail.com wrote:
My wife's box has a very intermittent problem,
Sounds kinda personal to me....
when booting from the Maxtor IDE hard drive. This has been going on for about 2 1/2
Oh. Never mind (bad humor).
years.... The box is a Compaq EVO D300v for the Enterprise. When it boots, there is a SMART advisory from the BIOS that says failure is immenient. Occasionally, it will not boot, because the BIOS does not see the hard drive.
<snip>
Seriously, now....
If there's a hardware problem with the drive, like loose connections, I'd get rid of it. I would have suggested a warranty swap-out, which you might be able to do if you still have the receipt or you registered the purchase (and sometimes even if not - check with Maxtor's web site), but it sounds like the drive may be out of warranty. Like I said, though, check with Maxtor - some of their older drives had 5 year warranties, and they have a general warranty period for all of their disks that yours might fall within.
I don't trust the SMART advisories all the time, mainly because I have a 1+ year old Seagate SATA drive that gets a smartd error every 30 minutes when the checks are performed. I have done numerous tests, including the ones supplied by Seagate in their Linux seatools package, and they all say that the drive is fine. (The error is suspicious anyway - it claims that there are 4294967295 unreadable or offline sectors, which is way more than the drive could possibly have, but that also just happens to be 0xFFFFFFFF.... Since I'm running an AMD 64x2, I'd bet that it's a 32-64 bit compatibility issue with the drive itself. Neither of my other two disks, which are also SATA, get any errors, and they're both WDs.) That said, yours look much more serious.
Conclusion: you are showing too many points of failure to warrant keeping the drive. Try to get a warranty replacement if you can, and if that doesn't work, just get a new drive. Since warranty replacements are usually refurbished disks, and the new warranty ends at the same time, you'd probably be better off with a new disk.
Them's my $0.04 (inflation, y'know).
HTH
mhr
On Sun, 24 May 2009 15:29:50 -0700 MHR wrote:
Conclusion: you are showing too many points of failure to warrant keeping the drive. Try to get a warranty replacement if you can, and if that doesn't work, just get a new drive. Since warranty replacements are usually refurbished disks, and the new warranty ends at the same time, you'd probably be better off with a new disk.
Yes, but back up anything still recoverable from the old drive before trashing it!
Am 25.05.2009 um 00:10 schrieb Lanny Marcus:
My wife's box has a very intermittent problem, when booting from the Maxtor IDE hard drive. This has been going on for about 2 1/2 years....
What did stop you from replacing it 2.5 years ago, BTW?
What's the warranty-policy for OEM-drives of Maxtor (now Seagate?)?
There's drives and there's OEM drives...
Rainer
Lanny Marcus wrote:
My wife's box has a very intermittent problem, when booting from the Maxtor IDE hard drive. This has been going on for about 2 1/2 years.... The box is a Compaq EVO D300v for the Enterprise. When it boots, there is a SMART advisory from the BIOS that says failure is immenient. Occasionally, it will not boot, because the BIOS does not see the hard drive. I replaced the EIDE cable, but the problem of sometimes not seeing the hard drive on boot continues. I suspect it has to do with something loose in the electronics of the drive, because if I press on both ends of the EIDE cable, the problem goes away and then it will boot OK.
[SNIP]
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 10 Spin_Retry_Count 0x002b 222 215 223 Pre-fail Always FAILING_NOW 29
A spin-up failure could be caused by a weak power supply or a power connector that is not making good contact. It is not likely to be a problem with the EIDE cable. Note that even if you do correct the problem, the SMART advisory will likely remain due to the accumulated failure count, but the boot failures should stop.
On Sun, 2009-05-24 at 23:39 -0500, Robert Nichols wrote:
Lanny Marcus wrote:
My wife's box has a very intermittent problem, when booting from the Maxtor IDE hard drive. This has been going on for about 2 1/2 years.... The box is a Compaq EVO D300v for the Enterprise. When it boots, there is a SMART advisory from the BIOS that says failure is immenient. Occasionally, it will not boot, because the BIOS does not see the hard drive. I replaced the EIDE cable, but the problem of sometimes not seeing the hard drive on boot continues. I suspect it has to do with something loose in the electronics of the drive, because if I press on both ends of the EIDE cable, the problem goes away and then it will boot OK.
[SNIP]
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 10 Spin_Retry_Count 0x002b 222 215 223 Pre-fail Always FAILING_NOW 29
A spin-up failure could be caused by a weak power supply or a power
If that's the case (spin-up delay), maybe ...
My BIOS has a setting to wait x seconds for the disk to spin up (I don't need it). If yours has that, maybe "That's the ticket, yeah" (Thanks SNL).
If it's a weak power supply, keep in mind that there are multiple rails with different capacities. PS rating may seem sufficient, but may be weak on one or more rails. Try using a PS connector from a different rail to split up the start-up draw.
If pushing on the connector on the drive seem to solve it, there's several things possible. Could be cable, could be cable end, could be a "cold solder joint" on the HD circuit board. Cable and connector can be easily, and inexpensively, replaced to test that. But it still could be the "colder solder joint" - replacing the cable might (temporarily) mimic the effects of your pushing on it,
If it's the "colder solder joint, the most likely spot is where the connector pins attach to the board. With a magnifying class, you may be able to see a hairline crack at one of the bins. The symptoms would be temperature-sensitive - it would tend to appear (presuming no external influence such as vibration or torquing of the unit) more consistently at cooler or warmer ambient temperatures, or if air circulation in the box is poor it might not seem related.
If you can see the poor joint, a skilled solderer with appropriate-sized irons and solder wire might be able to fix it. But that might be more expensive than buying a new one. I have repaired these in the past this way, but that was when 5.25 form-factor was standard and thing were larger. My hands are not really that nimble and I wouldn't try on today's stuff - it's all so much smaller.
connector that is not making good contact. It is not likely to be a problem with the EIDE cable. Note that even if you do correct the problem, the SMART advisory will likely remain due to the accumulated failure count, but the boot failures should stop.
Maybe the failure count gets cleared by a full run with no errors?