I’m running centos 4.3 (2.6.9-22.ELsmp) on a box ,and running windows XP on vmware 5.5.1. I have 3 scsi disk on this server,sda for linux system,sdb for vmware disk,and sdc for other. Resently I found that something wrong with the second disk,the guest os windowsxp copying files from a samba server(another box) to it’s disk,and for some time,maybe 5hours,3or1hours, the kernel said that sdb is offline… then I reboot,everything is ok,no filesystem check. but this problem accours every a few hours when the windowsxp copying a lot of files to it’s disk. I’m not sure if it was the heavy load of the disk made this problem. then I run smartctl to see if it’s overheat,but result nothing,smartctl said the temperature is ok (27-29c). is it a hardware problem? cable? disk is dying? kernel problem?vmware problem? and here is the dmesg dump when this problem happening.. Nov 26 08:41:43 server kernel: device eth2 entered promiscuous mode Nov 26 08:41:43 server kernel: bridge-eth2: enabled promiscuous mode Nov 27 19:12:07 server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x2a 0x0 0xc 0x7b 0x41 0x30 0x0 0x0 0x68 0x0 Nov 27 19:12:07 server kernel: scsi0: At time of recovery, card was not paused Nov 27 19:12:07 server kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< Nov 27 19:12:07 server kernel: scsi0: Dumping Card State at program address 0x26 Mode 0x22 Nov 27 19:12:07 server kernel: Card was paused Nov 27 19:12:07 server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] Nov 27 19:12:07 server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] Nov 27 19:12:07 server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] Nov 27 19:12:07 server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0] Nov 27 19:12:07 server kernel: SSTAT1[0x0] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] Nov 27 19:12:07 server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] Nov 27 19:12:07 server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1] Nov 27 19:12:07 server kernel: Nov 27 19:12:58 server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x3 NEXTSCB 0xff40 Nov 27 19:12:58 server kernel: qinstart = 23816 qinfifonext = 23816 Nov 27 19:12:58 server kernel: QINFIFO: Nov 27 19:12:58 server kernel: WAITING_TID_QUEUES: Nov 27 19:12:58 server kernel: Pending list: Nov 27 19:12:58 server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: Total 4 Nov 27 19:12:58 server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8 Nov 27 19:12:58 server kernel: Sequencer Complete DMA-inprog list: Nov 27 19:12:58 server kernel: Sequencer Complete list: Nov 27 19:12:58 server kernel: Sequencer DMA-Up and Complete list: Nov 27 19:12:58 server kernel: Nov 27 19:12:58 server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3 Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89] Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] Nov 27 19:12:58 server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3 Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] Nov 27 19:12:58 server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 Nov 27 19:12:58 server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52 Nov 27 19:12:58 server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1 Nov 27 19:12:58 server kernel: SIMODE0[0xc] Nov 27 19:12:58 server kernel: CCSCBCTL[0x0] Nov 27 19:12:58 server kernel: scsi0: REG0 == 0x3, SINDEX = 0x102, DINDEX = 0x102 Nov 27 19:12:58 server kernel: scsi0: SCBPTR == 0x3, SCB_NEXT == 0xff40, SCB_NEXT2 == 0xff86 Nov 27 19:12:58 server kernel: CDB 2a 0 1 80 8 6c Nov 27 19:12:58 server kernel: STACK: 0x14 0x0 0x0 0x0 0x0 0x0 0x0 0x0 Nov 27 19:12:58 server kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> Nov 27 19:12:58 server kernel: DevQ(0:0:0): 0 waiting Nov 27 19:12:58 server kernel: DevQ(0:1:0): 0 waiting Nov 27 19:12:58 server kernel: DevQ(0:2:0): 0 waiting Nov 27 19:12:58 server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB Nov 27 19:12:58 server kernel: Recovery code sleeping Nov 27 19:12:58 server kernel: (scsi0:A:1:0): Task Management Func 0x1 Complete Nov 27 19:12:58 server kernel: Recovery SCB completes Nov 27 19:12:58 server kernel: Recovery code awake Nov 27 19:12:58 server kernel: scsi0:0:1:0: Attempting to abort cmd f7a97500: 0x0 0x0 0x0 0x0 0x0 0x0 Nov 27 19:12:58 server kernel: scsi0: At time of recovery, card was not paused Nov 27 19:12:58 server kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< Nov 27 19:12:58 server kernel: scsi0: Dumping Card State at program address 0x24 Mode 0x0 Nov 27 19:12:58 server kernel: Card was paused Nov 27 19:12:58 server kernel: HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] Nov 27 19:12:58 server kernel: DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] Nov 27 19:12:58 server kernel: LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] Nov 27 19:12:58 server kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0] Nov 27 19:12:58 server kernel: SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] Nov 27 19:12:58 server kernel: SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] Nov 27 19:12:58 server kernel: LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0xe1] Nov 27 19:12:58 server kernel: Nov 27 19:12:58 server kernel: SCB Count = 12 CMDS_PENDING = 4 LASTSCB 0x6 CURRSCB 0x5 NEXTSCB 0xffc0 Nov 27 19:12:58 server kernel: qinstart = 23818 qinfifonext = 23818 Nov 27 19:12:58 server kernel: QINFIFO: Nov 27 19:12:58 server kernel: WAITING_TID_QUEUES: Nov 27 19:12:58 server kernel: Pending list: Nov 27 19:12:58 server kernel: 5 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 9 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: 7 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x17] Nov 27 19:12:58 server kernel: Total 4 Nov 27 19:12:58 server kernel: Kernel Free SCB list: 3 6 11 2 4 1 10 8 Nov 27 19:12:58 server kernel: Sequencer Complete DMA-inprog list: Nov 27 19:12:58 server kernel: Sequencer Complete list: Nov 27 19:12:58 server kernel: Sequencer DMA-Up and Complete list: Nov 27 19:12:58 server kernel: Nov 27 19:12:58 server kernel: scsi0: FIFO0 Free, LONGJMP == 0x8252, SCB 0x3 Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89] Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] Nov 27 19:12:58 server kernel: scsi0: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3 Nov 27 19:12:58 server kernel: SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] Nov 27 19:12:58 server kernel: SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] Nov 27 19:12:58 server kernel: SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 Nov 27 19:12:58 server kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] Nov 27 19:12:58 server kernel: LQIN: 0x8 0x0 0x0 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 Nov 27 19:12:58 server kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52 Nov 27 19:12:58 server kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1 Nov 27 19:12:58 server kernel: SIMODE0[0xc] Nov 27 19:12:58 server kernel: CCSCBCTL[0x4] Nov 27 19:12:59 server kernel: scsi0: REG0 == 0x6b60, SINDEX = 0x104, DINDEX = 0x104 。。。。。。 Nov 27 19:12:59 server kernel: DevQ(0:0:0): 0 waiting Nov 27 19:12:59 server kernel: DevQ(0:1:0): 0 waiting Nov 27 19:12:59 server kernel: DevQ(0:2:0): 0 waiting Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB Nov 27 19:12:59 server kernel: Recovery code sleeping Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Abort Tag Message Sent Nov 27 19:12:59 server kernel: (scsi0:A:1:0): SCB 5 - Abort Completed. Nov 27 19:12:59 server kernel: Recovery SCB completes Nov 27 19:12:59 server kernel: found == 0x1 Nov 27 19:12:59 server kernel: Recovery code awake Nov 27 19:12:59 server kernel: Recovery code sleeping Nov 27 19:12:59 server kernel: (scsi0:A:1:0): Bus Device Reset Message Sent Nov 27 19:12:59 server kernel: Recovery SCB completes Nov 27 19:12:59 server kernel: scsi0: Bus Device Reset on A:1. 1 SCBs aborted Nov 27 19:12:59 server kernel: Recovery code awake Nov 27 19:12:59 server kernel: scsi0: Device reset returning 0x2002 Nov 27 19:12:59 server kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 1 lun 0 Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000 Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209404208 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110526 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110527 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110528 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110529 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110530 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110531 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110532 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110533 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110534 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: Buffer I/O error on device sdb2, logical block 10110535 Nov 27 19:12:59 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: Aborting journal on device sdb2. Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: ext3_abort called. Nov 27 19:12:59 server kernel: EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal Nov 27 19:12:59 server kernel: Remounting filesystem read-only Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000 Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209368912 Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000 Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209369760 Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:12:59 server kernel: SCSI error : <0 0 1 0> return code = 0x10000 Nov 27 19:12:59 server kernel: end_request: I/O error, dev sdb, sector 209370072 Nov 27 19:12:59 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:13:29 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:13:29 server kernel: printk: 5631 messages suppressed. Nov 27 19:13:29 server kernel: Buffer I/O error on device sdb2, logical block 9928706 Nov 27 19:13:29 server kernel: lost page write due to I/O error on sdb2 Nov 27 19:18:54 server kernel: device eth2 left promiscuous mode Nov 27 19:18:54 server kernel: bridge-eth2: disabled promiscuous mode Nov 27 19:18:54 server kernel: device eth1 left promiscuous mode Nov 27 19:18:54 server kernel: bridge-eth1: disabled promiscuous mode Nov 27 19:18:54 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:18:54 server kernel: EXT3-fs error (device sdb2): ext3_find_entry: reading directory #4964353 offset 0 Nov 27 19:18:54 server kernel: Nov 27 19:20:19 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 6 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 7 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 8 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 9 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 10 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 11 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 12 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 13 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 14 Nov 27 19:20:19 server kernel: Buffer I/O error on device sdb2, logical block 15 Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 5 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 6 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 7 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 8 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 9 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 10 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 11 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 12 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 13 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 14 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb1, logical block 15 Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1033 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1034 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1035 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1036 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1037 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1038 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1039 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1040 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1041 Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1042 Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1551 Nov 27 19:22:39 server kernel: scsi0 (1:0): rejecting I/O to offline device Nov 27 19:22:39 server kernel: Buffer I/O error on device sdb2, logical block 1554 and the smartctl output [root at server ~]# smartctl -a /dev/sdb smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: SEAGATE ST3146807LC Version: 0007 Serial number: 3HY8YZNY00007613S1HZ Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Wed Nov 29 19:14:25 2006 CST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 29 C Drive Trip Temperature: 68 C Vendor (Seagate) cache information Blocks sent to initiator = 40668853 Blocks received from initiator = 2255578307 Blocks read from cache and sent to initiator = 15359133 Number of read and write commands whose size <= segment size = 22149142 Number of read and write commands whose size > segment size = 1651834 Vendor (Seagate/Hitachi) factory information number of hours powered up = 2001.53 number of minutes until next internal SMART test = 28 Error counter log: Errors Corrected by Total Correction Gigabytes Total EEC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 5754 1 0 5755 5987 165.985 0 write: 0 0 7 7 15847 69.478 0 Non-medium error count: 6491 Error Events logging not supported [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> - 2001 0x c83206c [0x3 0x11 0x0] # 2 Background short Completed - 2 - [- - -] # 3 Background short Completed - 2 - [- - -] Long (extended) Self Test duration: 3072 seconds [51.2 minutes] I’m running a long selftest of this disk. thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20061129/72c883a3/attachment-0004.html>