I’m running centos 4.3 (2.6.9-22.ELsmp) on a box ,and running windows XP on vmware 5.5.1.
I have 3 scsi disk on this server,sda for linux system,sdb for vmware disk,and sdc for other.
Resently I found that something wrong with the second disk,the guest os windowsxp copying
files from a samba server(another box) to it’s disk,and for some time,maybe 5hours,3or1hours,
the kernel said that sdb is offline…
then I reboot,everything is ok,no filesystem check.
but this problem accours every a few hours when the windowsxp copying a lot of files to it’s disk.
I’m not sure if it was the heavy load of the disk made this problem.
then I run smartctl to see if it’s overheat,but result nothing,smartctl said the temperature is ok (27-29c).
is it a hardware problem? cable? disk is dying? kernel problem?vmware problem?
and here is the dmesg dump when this problem happening..
Nov 26 08:41:43 server kernel: device eth2 entered promiscuous mode
Nov 26 08:41:43 server kernel: bridge-eth2: enabled promiscuous mode
Nov 27 19:12:07 server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x2a 0x0 0xc 0x7b 0x41 0x30 0x0 0x0 0x68 0x0
Nov 27 19:12:07 server kernel: scsi0: At time of recovery, card was not paused
Nov 27 19:12:07 server kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
Nov 27 19:12:07 server kernel: scsi0: Dumping Card State at program address 0x26 Mode 0x22
Nov 27 19:12:07 server kernel: Card was paused
Nov 27 19:12:07 server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
Nov 27 19:12:07 server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0]
Nov 27 19:12:07 server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0]
Nov 27 19:12:07 server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0]
Nov 27 19:12:07 server kernel: SSTAT1[0x0] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0]
Nov 27 19:12:07 server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
Nov 27 19:12:07 server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]
Nov 27 19:12:07 server kernel:
Nov 27 19:12:58 server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x3 NEXTSCB 0xff40
Nov 27 19:12:58 server kernel: qinstart = 23816 qinfifonext = 23816
Nov 27 19:12:58 server kernel: QINFIFO:
Nov 27 19:12:58 server kernel: WAITING_TID_QUEUES:
Nov 27 19:12:58 server kernel: Pending list:
Nov 27 19:12:58 server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: Total 4
Nov 27 19:12:58 server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8
Nov 27 19:12:58 server kernel: Sequencer Complete DMA-inprog list:
Nov 27 19:12:58 server kernel: Sequencer Complete list:
Nov 27 19:12:58 server kernel: Sequencer DMA-Up and Complete list:
Nov 27 19:12:58 server kernel:
Nov 27 19:12:58 server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3
Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89]
Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58 server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58 server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58 server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
Nov 27 19:12:58 server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
Nov 27 19:12:58 server kernel: SIMODE0[0xc]
Nov 27 19:12:58 server kernel: CCSCBCTL[0x0]
Nov 27 19:12:58 server kernel: scsi0: REG0 == 0x3, SINDEX = 0x102, DINDEX = 0x102
Nov 27 19:12:58 server kernel: scsi0: SCBPTR == 0x3, SCB_NEXT == 0xff40, SCB_NEXT2 == 0xff86
Nov 27 19:12:58 server kernel: CDB 2a 0 1 80 8 6c
Nov 27 19:12:58 server kernel: STACK: 0x14 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58 server kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
Nov 27 19:12:58 server kernel: DevQ(0:0:0): 0 waiting
Nov 27 19:12:58 server kernel: DevQ(0:1:0): 0 waiting
Nov 27 19:12:58 server kernel: DevQ(0:2:0): 0 waiting
Nov 27 19:12:58 server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB
Nov 27 19:12:58 server kernel: Recovery code sleeping
Nov 27 19:12:58 server kernel: (scsi0:A:1:0): Task Management Func 0x1 Complete
Nov 27 19:12:58 server kernel: Recovery SCB completes
Nov 27 19:12:58 server kernel: Recovery code awake
Nov 27 19:12:58 server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58 server kernel: scsi0: At time of recovery, card was not paused
Nov 27 19:12:58 server kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
Nov 27 19:12:58 server kernel: scsi0: Dumping Card State at program address 0x24 Mode 0x0
Nov 27 19:12:58 server kernel: Card was paused
Nov 27 19:12:58 server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
Nov 27 19:12:58 server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0]
Nov 27 19:12:58 server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0]
Nov 27 19:12:58 server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0]
Nov 27 19:12:58 server kernel: SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0]
Nov 27 19:12:58 server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
Nov 27 19:12:58 server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]
Nov 27 19:12:58 server kernel:
Nov 27 19:12:58 server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x5 NEXTSCB 0xffc0
Nov 27 19:12:58 server kernel: qinstart = 23818 qinfifonext = 23818
Nov 27 19:12:58 server kernel: QINFIFO:
Nov 27 19:12:58 server kernel: WAITING_TID_QUEUES:
Nov 27 19:12:58 server kernel: Pending list:
Nov 27 19:12:58 server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58 server kernel: Total 4
Nov 27 19:12:58 server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8
Nov 27 19:12:58 server kernel: Sequencer Complete DMA-inprog list:
Nov 27 19:12:58 server kernel: Sequencer Complete list:
Nov 27 19:12:58 server kernel: Sequencer DMA-Up and Complete list:
Nov 27 19:12:58 server kernel:
Nov 27 19:12:58 server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3
Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89]
Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58 server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58 server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58 server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
Nov 27 19:12:58 server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
Nov 27 19:12:58 server kernel: SIMODE0[0xc]
Nov 27 19:12:58 server kernel: CCSCBCTL[0x4]
Nov 27 19:12:59 server kernel: scsi0: REG0 == 0x6b60, SINDEX = 0x104, DINDEX = 0x104
。。。。。。
Nov 27 19:12:59 server kernel: DevQ(0:0:0): 0 waiting
Nov 27 19:12:59 server kernel: DevQ(0:1:0): 0 waiting
Nov 27 19:12:59 server kernel: DevQ(0:2:0): 0 waiting
Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB
Nov 27 19:12:59 server kernel: Recovery code sleeping
Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Abort Tag Message Sent
Nov 27 19:12:59 server kernel: (scsi0:A:1:0): SCB 5 - Abort Completed.
Nov 27 19:12:59 server kernel: Recovery SCB completes
Nov 27 19:12:59 server kernel: found == 0x1
Nov 27 19:12:59 server kernel: Recovery code awake
Nov 27 19:12:59 server kernel: Recovery code sleeping
Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Bus Device Reset Message Sent
Nov 27 19:12:59 server kernel: Recovery SCB completes
Nov 27 19:12:59 server kernel: scsi0: Bus Device Reset on A:1. 1 SCBs aborted
Nov 27 19:12:59 server kernel: Recovery code awake
Nov 27 19:12:59 server kernel: scsi0: Device reset returning 0x2002
Nov 27 19:12:59 server kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 1 lun 0
Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209404208
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110526
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110527
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110528
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110529
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110530
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110531
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110532
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110533
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110534
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110535
Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: Aborting journal on device sdb2.
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: ext3_abort called.
Nov 27 19:12:59 server kernel: EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal
Nov 27 19:12:59 server kernel: Remounting filesystem read-only
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209368912
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209369760
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209370072
Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:13:29 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:13:29 server kernel: printk: 5631 messages suppressed.
Nov 27 19:13:29 server kernel: Buffer I/O error on device sdb2, logical block 9928706
Nov 27 19:13:29 server kernel: lost page write due to I/O error on sdb2
Nov 27 19:18:54 server kernel: device eth2 left promiscuous mode
Nov 27 19:18:54 server kernel: bridge-eth2: disabled promiscuous mode
Nov 27 19:18:54 server kernel: device eth1 left promiscuous mode
Nov 27 19:18:54 server kernel: bridge-eth1: disabled promiscuous mode
Nov 27 19:18:54 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:18:54 server kernel: EXT3-fs error (device sdb2): ext3_find_entry: reading directory #4964353 offset 0
Nov 27 19:18:54 server kernel:
Nov 27 19:20:19 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 6
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 7
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 8
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 9
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 10
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 11
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 12
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 13
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 14
Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 15
Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 5
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 6
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 7
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 8
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 9
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 10
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 11
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 12
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 13
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 14
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 15
Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1033
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1034
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1035
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1036
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1037
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1038
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1039
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1040
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1041
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1042
Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1551
Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1554
and the smartctl output
[root@server ~]# smartctl -a /dev/sdb
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: SEAGATE ST3146807LC Version: 0007
Serial number: 3HY8YZNY00007613S1HZ
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Wed Nov 29 19:14:25 2006 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 29 C
Drive Trip Temperature: 68 C
Vendor (Seagate) cache information
Blocks sent to initiator = 40668853
Blocks received from initiator = 2255578307
Blocks read from cache and sent to initiator = 15359133
Number of read and write commands whose size <= segment size = 22149142
Number of read and write commands whose size > segment size = 1651834
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 2001.53
number of minutes until next internal SMART test = 28
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
EEC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 5754 1 0 5755 5987 165.985 0
write: 0 0 7 7 15847 69.478 0
Non-medium error count: 6491
Error Events logging not supported
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> - 2001 0x c83206c [0x3 0x11 0x0]
# 2 Background short Completed - 2 - [- - -]
# 3 Background short Completed - 2 - [- - -]
Long (extended) Self Test duration: 3072 seconds [51.2 minutes]
I’m running a long selftest of this disk.
thanks!