I¡¯m running centos
4.3 (2.6.9-22.ELsmp) on a box ,and running windows XP on vmware 5.5.1.
I have 3 scsi disk
on this server,sda for linux system,sdb for vmware disk,and sdc for other.
Resently I found
that something wrong with the second disk,the guest os windowsxp copying
files from a samba
server(another box) to it¡¯s disk,and for some time,maybe 5hours,3or1hours,
the kernel said
that sdb is offline¡
then I reboot,everything
is ok,no filesystem check.
but this problem
accours every a few hours when the windowsxp copying a lot of files to it¡¯s
disk.
I¡¯m not sure if it
was the heavy load of the disk made this problem.
then I run smartctl
to see if it¡¯s overheat,but result nothing,smartctl said the temperature is ok
(27-29c).
is it a hardware problem?
cable? disk is dying? kernel problem?vmware problem?
and here is the
dmesg dump when this problem happening..
Nov 26 08:41:43
server kernel: device eth2 entered promiscuous mode
Nov 26 08:41:43
server kernel: bridge-eth2: enabled promiscuous mode
Nov 27 19:12:07
server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x2a 0x0 0xc 0x7b
0x41 0x30 0x0 0x0 0x68 0x0
Nov 27 19:12:07
server kernel: scsi0: At time of recovery, card was not paused
Nov 27 19:12:07
server kernel:
>>>>>>>>>>>>>>>>>> Dump
Card State Begins
<<<<<<<<<<<<<<<<<
Nov 27 19:12:07
server kernel: scsi0: Dumping Card State at program address 0x26 Mode 0x22
Nov 27 19:12:07
server kernel: Card was paused
Nov 27 19:12:07
server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
Nov 27 19:12:07
server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0]
Nov 27 19:12:07
server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0]
Nov 27 19:12:07
server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0]
Nov 27 19:12:07
server kernel: SSTAT1[0x0] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0]
Nov 27 19:12:07
server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
Nov 27 19:12:07
server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]
Nov 27 19:12:07
server kernel:
Nov 27 19:12:58
server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x3 NEXTSCB
0xff40
Nov 27 19:12:58
server kernel: qinstart = 23816 qinfifonext = 23816
Nov 27 19:12:58
server kernel: QINFIFO:
Nov 27 19:12:58
server kernel: WAITING_TID_QUEUES:
Nov 27 19:12:58
server kernel: Pending list:
Nov 27 19:12:58
server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: Total 4
Nov 27 19:12:58
server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8
Nov 27 19:12:58
server kernel: Sequencer Complete DMA-inprog list:
Nov 27 19:12:58
server kernel: Sequencer Complete list:
Nov 27 19:12:58
server kernel: Sequencer DMA-Up and Complete list:
Nov 27 19:12:58
server kernel:
Nov 27 19:12:58
server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3
Nov 27 19:12:58
server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89]
Nov 27 19:12:58
server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58
server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58
server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58
server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
Nov 27 19:12:58
server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
Nov 27 19:12:58
server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58
server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58
server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58
server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58
server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
Nov 27 19:12:58
server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
Nov 27 19:12:58
server kernel: SIMODE0[0xc]
Nov 27 19:12:58
server kernel: CCSCBCTL[0x0]
Nov 27 19:12:58
server kernel: scsi0: REG0 == 0x3, SINDEX = 0x102, DINDEX = 0x102
Nov 27 19:12:58 server
kernel: scsi0: SCBPTR == 0x3, SCB_NEXT == 0xff40, SCB_NEXT2 == 0xff86
Nov 27 19:12:58
server kernel: CDB 2a 0 1 80 8 6c
Nov 27 19:12:58
server kernel: STACK: 0x14 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58
server kernel:
<<<<<<<<<<<<<<<<< Dump Card
State Ends
>>>>>>>>>>>>>>>>>>
Nov 27 19:12:58
server kernel: DevQ(0:0:0): 0 waiting
Nov 27 19:12:58
server kernel: DevQ(0:1:0): 0 waiting
Nov 27 19:12:58
server kernel: DevQ(0:2:0): 0 waiting
Nov 27 19:12:58
server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB
Nov 27 19:12:58
server kernel: Recovery code sleeping
Nov 27 19:12:58
server kernel: (scsi0:A:1:0): Task Management Func 0x1 Complete
Nov 27 19:12:58
server kernel: Recovery SCB completes
Nov 27 19:12:58
server kernel: Recovery code awake
Nov 27 19:12:58
server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x0 0x0 0x0 0x0
0x0 0x0
Nov 27 19:12:58
server kernel: scsi0: At time of recovery, card was not paused
Nov 27 19:12:58
server kernel:
>>>>>>>>>>>>>>>>>> Dump
Card State Begins
<<<<<<<<<<<<<<<<<
Nov 27 19:12:58
server kernel: scsi0: Dumping Card State at program address 0x24 Mode 0x0
Nov 27 19:12:58
server kernel: Card was paused
Nov 27 19:12:58
server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
Nov 27 19:12:58
server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0]
Nov 27 19:12:58
server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0]
Nov 27 19:12:58
server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0]
Nov 27 19:12:58
server kernel: SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0]
Nov 27 19:12:58
server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
Nov 27 19:12:58
server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1]
Nov 27 19:12:58
server kernel:
Nov 27 19:12:58
server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x5 NEXTSCB
0xffc0
Nov 27 19:12:58
server kernel: qinstart = 23818 qinfifonext = 23818
Nov 27 19:12:58
server kernel: QINFIFO:
Nov 27 19:12:58
server kernel: WAITING_TID_QUEUES:
Nov 27 19:12:58
server kernel: Pending list:
Nov 27 19:12:58
server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17]
Nov 27 19:12:58
server kernel: Total 4
Nov 27 19:12:58
server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8
Nov 27 19:12:58
server kernel: Sequencer Complete DMA-inprog list:
Nov 27 19:12:58
server kernel: Sequencer Complete list:
Nov 27 19:12:58
server kernel: Sequencer DMA-Up and Complete list:
Nov 27 19:12:58
server kernel:
Nov 27 19:12:58
server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3
Nov 27 19:12:58
server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89]
Nov 27 19:12:58
server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58
server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58
server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58
server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3
Nov 27 19:12:58
server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
Nov 27 19:12:58
server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
Nov 27 19:12:58
server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
Nov 27 19:12:58
server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
Nov 27 19:12:58
server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0
Nov 27 19:12:58
server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
Nov 27 19:12:58
server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
Nov 27 19:12:58
server kernel: SIMODE0[0xc]
Nov 27 19:12:58
server kernel: CCSCBCTL[0x4]
Nov 27 19:12:59 server
kernel: scsi0: REG0 == 0x6b60, SINDEX = 0x104, DINDEX = 0x104
¡£¡£¡£¡£¡£¡£
Nov 27 19:12:59
server kernel: DevQ(0:0:0): 0 waiting
Nov 27 19:12:59
server kernel: DevQ(0:1:0): 0 waiting
Nov 27 19:12:59
server kernel: DevQ(0:2:0): 0 waiting
Nov 27 19:12:59
server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB
Nov 27 19:12:59
server kernel: Recovery code sleeping
Nov 27 19:12:59
server kernel: (scsi0:A:1:0): Abort Tag Message Sent
Nov 27 19:12:59
server kernel: (scsi0:A:1:0): SCB 5 - Abort Completed.
Nov 27 19:12:59
server kernel: Recovery SCB completes
Nov 27 19:12:59
server kernel: found == 0x1
Nov 27 19:12:59
server kernel: Recovery code awake
Nov 27 19:12:59
server kernel: Recovery code sleeping
Nov 27 19:12:59
server kernel: (scsi0:A:1:0): Bus Device Reset Message Sent
Nov 27 19:12:59
server kernel: Recovery SCB completes
Nov 27 19:12:59
server kernel: scsi0: Bus Device Reset on A:1. 1 SCBs aborted
Nov 27 19:12:59
server kernel: Recovery code awake
Nov 27 19:12:59
server kernel: scsi0: Device reset returning 0x2002
Nov 27 19:12:59
server kernel: scsi: Device offlined - not ready after error recovery: host 0
channel 0 id 1 lun 0
Nov 27 19:12:59
server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59
server kernel: end_request: I/O error, dev sdb, sector 209404208
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110526
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110527
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110528
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110529
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110530
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110531
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110532
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110533
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110534
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: Buffer I/O error on device sdb2, logical block 10110535
Nov 27 19:12:59
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: Aborting journal on device sdb2.
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: ext3_abort called.
Nov 27 19:12:59
server kernel: EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected
aborted journal
Nov 27 19:12:59
server kernel: Remounting filesystem read-only
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59
server kernel: end_request: I/O error, dev sdb, sector 209368912
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59
server kernel: end_request: I/O error, dev sdb, sector 209369760
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:12:59
server kernel: SCSI error : <0 0 1 0> return code = 0x10000
Nov 27 19:12:59
server kernel: end_request: I/O error, dev sdb, sector 209370072
Nov 27 19:12:59
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:13:29
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:13:29
server kernel: printk: 5631 messages suppressed.
Nov 27 19:13:29
server kernel: Buffer I/O error on device sdb2, logical block 9928706
Nov 27 19:13:29
server kernel: lost page write due to I/O error on sdb2
Nov 27 19:18:54
server kernel: device eth2 left promiscuous mode
Nov 27 19:18:54
server kernel: bridge-eth2: disabled promiscuous mode
Nov 27 19:18:54
server kernel: device eth1 left promiscuous mode
Nov 27 19:18:54
server kernel: bridge-eth1: disabled promiscuous mode
Nov 27 19:18:54
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:18:54
server kernel: EXT3-fs error (device sdb2): ext3_find_entry: reading directory
#4964353 offset 0
Nov 27 19:18:54
server kernel:
Nov 27 19:20:19
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 6
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 7
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 8
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 9
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 10
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 11
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 12
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 13
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 14
Nov 27 19:20:19
server kernel: Buffer I/O error on device sdb2, logical block 15
Nov 27 19:22:39
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 5
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 6
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 7
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 8
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 9
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 10
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 11
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 12
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 13
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 14
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb1, logical block 15
Nov 27 19:22:39
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1033
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1034
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1035
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1036
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1037
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1038
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1039
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1040
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1041
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1042
Nov 27 19:22:39
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1551
Nov 27 19:22:39
server kernel: scsi0 (1:0): rejecting I/O to offline device
Nov 27 19:22:39
server kernel: Buffer I/O error on device sdb2, logical block 1554
and the smartctl
output
[root@server ~]#
smartctl -a /dev/sdb
smartctl version
5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is
http://smartmontools.sourceforge.net/
Device:
SEAGATE ST3146807LC Version: 0007
Serial number:
3HY8YZNY00007613S1HZ
Device type: disk
Transport protocol:
Parallel SCSI (SPI-4)
Local Time is: Wed
Nov 29 19:14:25 2006 CST
Device supports
SMART and is Enabled
Temperature Warning
Enabled
SMART Health
Status: OK
Current Drive
Temperature: 29 C
Drive Trip
Temperature: 68 C
Vendor (Seagate)
cache information
Blocks sent
to initiator = 40668853
Blocks
received from initiator = 2255578307
Blocks read
from cache and sent to initiator = 15359133
Number of
read and write commands whose size <= segment size = 22149142
Number of
read and write commands whose size > segment size = 1651834
Vendor
(Seagate/Hitachi) factory information
number of
hours powered up = 2001.53
number of
minutes until next internal SMART test = 28
Error counter log:
Errors Corrected by
Total Correction
Gigabytes Total
EEC
rereads/ errors
algorithm processed uncorrected
fast | delayed rewrites corrected
invocations [10^9 bytes] errors
read:
5754 1
0 5755
5987
165.985 0
write:
0
0
7
7 15847
69.478 0
Non-medium error
count: 6491
Error Events
logging not supported
[GLTSD (Global
Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num
Test
Status
segment LifeTime LBA_first_err [SK ASC ASQ]
Description
number (hours)
#
1 Background long Failed in segment --> -
2001 0x c83206c [0x3 0x11
0x0]
# 2
Background short
Completed
-
2
- [- - -]
# 3
Background short
Completed
-
2
- [- - -]
Long (extended)
Self Test duration: 3072 seconds [51.2 minutes]
I¡¯m running a long selftest
of this disk.
thanks!