Hello,
I started smartctl -t short of disk in RAID1, but during this operation this disk was kicked from RAID (only from one MD of three).
/var/log/messages: Nov 1 16:45:45 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 1 16:45:45 server kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov 1 16:45:45 server kernel: res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error) Nov 1 16:45:45 server kernel: ata1.00: status: { DRDY ERR } Nov 1 16:45:45 serve kernel: ata1.00: error: { ABRT } Nov 1 16:45:45 server kernel: ata1.00: configured for UDMA/133 Nov 1 16:45:45 server kernel: ata1: EH complete Nov 1 16:45:45 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov 1 16:45:45 server kernel: sda: Write Protect is off Nov 1 16:45:45 server kernel: SCSI device sda: drive cache: write back Nov 1 16:45:52 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 1 16:45:52 server kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Nov 1 16:45:52 server kernel: res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error) Nov 1 16:45:52 server kernel: ata1.00: status: { DRDY ERR } Nov 1 16:45:52 server kernel: ata1.00: error: { ABRT } Nov 1 16:45:52 server kernel: ata1.00: configured for UDMA/133 Nov 1 16:45:52 server kernel: ata1: EH complete Nov 1 16:45:52 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov 1 16:45:52 server kernel: sda: Write Protect is off Nov 1 16:45:52 server kernel: SCSI device sda: drive cache: write back Nov 1 16:47:43 server kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 1 16:47:43 server kernel: ata1.00: BMDMA stat 0x25 Nov 1 16:47:43 server kernel: ata1.00: cmd ca/00:08:1f:41:1e/00:00:00:00:00/e1 tag 0 dma 4096 out Nov 1 16:47:43 server kernel: res 51/10:08:1f:41:1e/00:00:00:00:00/e1 Emask 0x81 (invalid argument) Nov 1 16:47:43 server kernel: ata1.00: status: { DRDY ERR } Nov 1 16:47:43 server kernel: ata1.00: error: { IDNF } Nov 1 16:47:43 server kernel: ata1.00: configured for UDMA/133 Nov 1 16:47:43 server kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002 Nov 1 16:47:43 server kernel: sda: Current [descriptor]: sense key: Aborted Command Nov 1 16:47:43 server kernel: Add. Sense: Recorded entity not found Nov 1 16:47:43 server kernel: Nov 1 16:47:44 server kernel: Descriptor sense data with sense descriptors (in hex): Nov 1 16:47:44 server kernel: 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Nov 1 16:47:44 server kernel: 01 1e 41 1f Nov 1 16:47:44 server kernel: end_request: I/O error, dev sda, sector 18759967 Nov 1 16:47:44 server kernel: raid1: Disk failure on sda1, disabling device. Nov 1 16:47:44 server kernel: Operation continuing on 1 devices Nov 1 16:47:44 server kernel: ata1: EH complete Nov 1 16:47:44 server kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) Nov 1 16:47:44 server kernel: sda: Write Protect is off Nov 1 16:47:44 server kernel: SCSI device sda: drive cache: write back Nov 1 16:47:44 server kernel: RAID1 conf printout: Nov 1 16:47:44 server kernel: --- wd:1 rd:2 Nov 1 16:47:44 server kernel: disk 0, wo:1, o:0, dev:sda1 Nov 1 16:47:44 server kernel: disk 1, wo:0, o:1, dev:sdb1 Nov 1 16:47:44 server kernel: RAID1 conf printout: Nov 1 16:47:44 server kernel: --- wd:1 rd:2 Nov 1 16:47:44 server kernel: disk 1, wo:0, o:1, dev:sdb1
And output of smarctl -all: === START OF INFORMATION SECTION === Device Model: WDC WD3201ABYS-01B9A0 Serial Number: Firmware Version: 13.01C02 User Capacity: 320 072 933 376 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Nov 1 19:26:47 2009 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (8400) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 100) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 156 156 021 Pre-fail Always - 5183 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 82 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 14329 10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 58 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 82 194 Temperature_Celsius 0x0022 123 106 000 Old_age Always - 24 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 5 occurred at disk power-on lifetime: 14329 hours (597 days + 1 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 a7 4a 1e e1 Error: IDNF at LBA = 0x011e4aa7 = 18762407
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 a7 4a 1e e1 0a 45d+04:44:26.835 WRITE DMA ca 00 08 3f 14 00 e2 0a 45d+04:44:26.816 WRITE DMA
Error 4 occurred at disk power-on lifetime: 14326 hours (596 days + 22 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 1f 41 1e e1 Error: IDNF at LBA = 0x011e411f = 18759967
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 1f 41 1e e1 0a 45d+02:30:44.518 WRITE DMA ca 00 08 3f 14 00 e2 0a 45d+02:30:44.504 WRITE DMA
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 14329 - # 2 Short offline Completed without error 00% 14329 - # 3 Extended offline Completed without error 00% 14328 - # 4 Short offline Completed without error 00% 14326 -
SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
This device was kicked from MD during #4 and #2 short test. Second HDD (same as this problematic) is without errors.
Thank you for you help