NOTE: this is happening on Centos 6 x86_64, 2.6.32-504.3.3.el6.x86_64 not Centos 5
Dell PowerEdge 2970, Seagate SATA drive, non-raid.
I have this server which has been dying randomly, with no logs.
I had a tail -f over ssh for a week, when this just happened.
Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880057a0a080) Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: Initiating recovery Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) Write failed: Connection reset by peer
After reading https://access.redhat.com/solutions/108273, I am increasing the logging (shown below) but I am not confident about this wait and see approach.
sysctl -w dev.scsi.logging_level=98367
I am also going to check smartctl output once I get onsite to power cycle the system.
Other posts I have read, but I can not act on yet:
* http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-task-abort-succe... * https://bugzilla.kernel.org/show_bug.cgi?id=18652 * https://bugzilla.redhat.com/show_bug.cgi?id=483424 * https://bugzilla.kernel.org/show_bug.cgi?id=42765 * http://sourceforge.net/p/smartmontools/mailman/message/23849184/ * http://kb.softescu.ro/category/hardware/dell/
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
-----Original Message----- From: Jason Pyeron Sent: Saturday, February 07, 2015 22:54
NOTE: this is happening on Centos 6 x86_64, 2.6.32-504.3.3.el6.x86_64 not Centos 5
Dell PowerEdge 2970, Seagate SATA drive, non-raid.
I have this server which has been dying randomly, with no logs.
Here is a console picture.
http://i.imgur.com/ZYHlB82.jpg
I had a tail -f over ssh for a week, when this just happened.
Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880057a0a080) Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING
- Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: Initiating recovery Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) Write failed: Connection reset by peer
After reading https://access.redhat.com/solutions/108273, I am increasing the logging (shown below) but I am not confident about this wait and see approach.
sysctl -w dev.scsi.logging_level=98367
I am also going to check smartctl output once I get onsite to power cycle the system.
# smartctl -a /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-504.3.3.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST1500DM003-9YN16G Serial Number: W24153R0 LU WWN Device Id: 5 000c50 05d03cc1d Firmware Version: CC82 User Capacity: 1,500,301,910,016 bytes [1.50 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Feb 7 23:41:00 2015 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 194) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 181943016 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 39599363 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 821 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 17 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 062 045 Old_age Always - 33 (Min/Max 30/33) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 16 193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always - 4551 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 267112606073648 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2764453802303 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3442873711291
SMART Error Log Version: 1 No Errors Logged
SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Other posts I have read, but I can not act on yet:
http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-ta
sk-abort-success-rv-2002-causes-30-seconds-freezing
- https://bugzilla.kernel.org/show_bug.cgi?id=18652
- https://bugzilla.redhat.com/show_bug.cgi?id=483424
- https://bugzilla.kernel.org/show_bug.cgi?id=42765
- http://sourceforge.net/p/smartmontools/mailman/message/23849184/
- http://kb.softescu.ro/category/hardware/dell/
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
-----Original Message----- From: Jason Pyeron Sent: Sunday, February 08, 2015 0:00
-----Original Message----- From: Jason Pyeron Sent: Saturday, February 07, 2015 22:54
NOTE: this is happening on Centos 6 x86_64, 2.6.32-504.3.3.el6.x86_64 not Centos 5
Dell PowerEdge 2970, Seagate SATA drive, non-raid.
I have this server which has been dying randomly, with no logs.
Here is a console picture.
Thanks to netconsole, I have the panic to post:
Feb 16 06:06:56 BUG: soft lockup - CPU#0 stuck for 67s! [ksmd:88] Feb 16 06:06:56 Modules linked in: Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT Feb 16 06:06:56 nf_conntrack_ipv4 Feb 16 06:06:56 nf_defrag_ipv4 Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT Feb 16 06:06:56 nf_conntrack_ipv6 Feb 16 06:06:56 nf_defrag_ipv6 Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6 Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2 Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4 Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4 Feb 16 06:06:56 jbd2 Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu] Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 CPU 0 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Modules linked in: Feb 16 06:06:56 nf_nat Feb 16 06:06:56 mpt3sas Feb 16 06:06:56 mpt2sas Feb 16 06:06:56 raid_class Feb 16 06:06:56 mptctl Feb 16 06:06:56 ipmi_si Feb 16 06:06:56 ipmi_devintf Feb 16 06:06:56 netconsole Feb 16 06:06:56 configfs Feb 16 06:06:56 ebtable_nat Feb 16 06:06:56 ebtables Feb 16 06:06:56 nfs Feb 16 06:06:56 lockd Feb 16 06:06:56 fscache Feb 16 06:06:56 auth_rpcgss Feb 16 06:06:56 nfs_acl Feb 16 06:06:56 sunrpc Feb 16 06:06:56 bridge Feb 16 06:06:56 stp Feb 16 06:06:56 llc Feb 16 06:06:56 ipt_REJECT Feb 16 06:06:56 nf_conntrack_ipv4 Feb 16 06:06:56 nf_defrag_ipv4 Feb 16 06:06:56 iptable_filter Feb 16 06:06:56 ip_tables Feb 16 06:06:56 ip6t_REJECT Feb 16 06:06:56 nf_conntrack_ipv6 Feb 16 06:06:56 nf_defrag_ipv6 Feb 16 06:06:56 xt_state Feb 16 06:06:56 nf_conntrack Feb 16 06:06:56 ip6table_filter Feb 16 06:06:56 ip6_tables Feb 16 06:06:56 ipv6 Feb 16 06:06:56 dm_snapshot Feb 16 06:06:56 dm_bufio Feb 16 06:06:56 dm_zero Feb 16 06:06:56 vhost_net Feb 16 06:06:56 macvtap Feb 16 06:06:56 macvlan Feb 16 06:06:56 tun Feb 16 06:06:56 kvm_amd Feb 16 06:06:56 kvm Feb 16 06:06:56 ipmi_msghandler Feb 16 06:06:56 dcdbas Feb 16 06:06:56 serio_raw Feb 16 06:06:56 bnx2 Feb 16 06:06:56 k10temp Feb 16 06:06:56 amd64_edac_mod Feb 16 06:06:56 edac_core Feb 16 06:06:56 edac_mce_amd Feb 16 06:06:56 sg Feb 16 06:06:56 i2c_piix4 Feb 16 06:06:56 shpchp Feb 16 06:06:56 ext4 Feb 16 06:06:56 jbd2 Feb 16 06:06:56 mbcache Feb 16 06:06:56 sd_mod Feb 16 06:06:56 crc_t10dif Feb 16 06:06:56 mptsas Feb 16 06:06:56 mptscsih Feb 16 06:06:56 mptbase Feb 16 06:06:56 scsi_transport_sas Feb 16 06:06:56 ata_generic Feb 16 06:06:56 pata_acpi Feb 16 06:06:56 sata_svw Feb 16 06:06:56 radeon Feb 16 06:06:56 ttm Feb 16 06:06:56 drm_kms_helper Feb 16 06:06:56 drm Feb 16 06:06:56 i2c_algo_bit Feb 16 06:06:56 i2c_core Feb 16 06:06:56 dm_mirror Feb 16 06:06:56 dm_region_hash Feb 16 06:06:56 dm_log Feb 16 06:06:56 dm_mod Feb 16 06:06:56 [last unloaded: dell_rbu] Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Pid: 88, comm: ksmd Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:06:56 Dell Inc. PowerEdge 2970 Feb 16 06:06:56 /0JKN8W Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 RIP: 0010:[<ffffffff812a1411>] Feb 16 06:06:56 [<ffffffff812a1411>] __bitmap_empty+0x41/0x90 Feb 16 06:06:56 RSP: 0018:ffff88021831dcb0 EFLAGS: 00000202 Feb 16 06:06:56 RAX: 0000000000000000 RBX: ffff88021831dcb0 RCX: 0000000000000010 Feb 16 06:06:56 RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffffffff81e2f198 Feb 16 06:06:56 RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000 Feb 16 06:06:56 R10: ffffea0006679c20 R11: 0000000000000000 R12: 0000000000000000 Feb 16 06:06:56 R13: ffff8801c1b8f650 R14: 0000000198152467 R15: ffffffffa03af44a Feb 16 06:06:56 FS: 00007fc4756b09a0(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Feb 16 06:06:56 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Feb 16 06:06:56 CR2: 000000c641faeff0 CR3: 0000000001a85000 CR4: 00000000000007f0 Feb 16 06:06:56 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 16 06:06:56 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Feb 16 06:06:56 Process ksmd (pid: 88, threadinfo ffff88021831c000, task ffff880218310040) Feb 16 06:06:56 Stack: Feb 16 06:06:56 ffff88021831dd00 Feb 16 06:06:56 ffffffff81052268 Feb 16 06:06:56 00007f30249b8000 Feb 16 06:06:56 ffffffff81e2f180 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 d> Feb 16 06:06:56 8000000198152025 Feb 16 06:06:56 ffff880219ade700 Feb 16 06:06:56 00007f30249b8000 Feb 16 06:06:56 ffff880219ade9c8 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 d> Feb 16 06:06:56 ffffea0006679c20 Feb 16 06:06:56 ffff880219e57ed0 Feb 16 06:06:56 ffff88021831dd30 Feb 16 06:06:56 ffffffff810522e6 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Call Trace: Feb 16 06:06:56 [<ffffffff81052268>] ? flush_tlb_others_ipi+0x128/0x130 Feb 16 06:06:56 [<ffffffff810522e6>] ? native_flush_tlb_others+0x76/0x90 Feb 16 06:06:56 [<ffffffff8105240e>] ? flush_tlb_page+0x5e/0xb0 Feb 16 06:06:56 [<ffffffff811721c2>] ? try_to_merge_with_ksm_page+0x532/0x660 Feb 16 06:06:56 [<ffffffff811731a4>] ? ksm_scan_thread+0xeb4/0x1120 Feb 16 06:06:56 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 16 06:06:56 [<ffffffff811722f0>] ? ksm_scan_thread+0x0/0x1120 Feb 16 06:06:56 [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 16 06:06:56 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 16 06:06:56 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 16 06:06:56 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 16 06:06:56 Code: Feb 16 06:06:56 c0 Feb 16 06:06:56 7e Feb 16 06:06:56 24 Feb 16 06:06:56 48 Feb 16 06:06:56 83 Feb 16 06:06:56 3f Feb 16 06:06:56 00 Feb 16 06:06:56 48 Feb 16 06:06:56 89 Feb 16 06:06:56 f8 Feb 16 06:06:56 74 Feb 16 06:06:56 13 Feb 16 06:06:56 eb Feb 16 06:06:56 5c Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 40 Feb 16 06:06:56 00 Feb 16 06:06:56 48 Feb 16 06:06:56 8b Feb 16 06:06:56 48 Feb 16 06:06:56 08 Feb 16 06:06:56 48 Feb 16 06:06:56 83 Feb 16 06:06:56 c0 Feb 16 06:06:56 08 Feb 16 06:06:56 48 Feb 16 06:06:56 85 Feb 16 06:06:56 c9 Feb 16 06:06:56 75 Feb 16 06:06:56 4b Feb 16 06:06:56 83 Feb 16 06:06:56 c2 Feb 16 06:06:56 01 Feb 16 06:06:56 41 Feb 16 06:06:56 39 Feb 16 06:06:56 d0 Feb 16 06:06:56 7f Feb 16 06:06:56 eb Feb 16 06:06:56 40 Feb 16 06:06:56 f6 Feb 16 06:06:56 c6 Feb 16 06:06:56 3f Feb 16 06:06:56 b8> Feb 16 06:06:56 01 Feb 16 06:06:56 00 Feb 16 06:06:56 last message repeated 2 times Feb 16 06:06:56 75 Feb 16 06:06:56 08 Feb 16 06:06:56 c9 Feb 16 06:06:56 c3 Feb 16 06:06:56 66 Feb 16 06:06:56 0f Feb 16 06:06:56 1f Feb 16 06:06:56 44 Feb 16 06:06:56 00 Feb 16 06:06:56 00 Feb 16 06:06:56 89 Feb 16 06:06:56 f0 Feb 16 06:06:56 48 Feb 16 06:06:56 63 Feb 16 06:06:56 d2 Feb 16 06:06:56 c1 Feb 16 06:06:56 192.168.13.230 Feb 16 06:06:56 Call Trace: Feb 16 06:06:56 [<ffffffff81052268>] ? flush_tlb_others_ipi+0x128/0x130 Feb 16 06:06:56 [<ffffffff810522e6>] ? native_flush_tlb_others+0x76/0x90 Feb 16 06:06:56 [<ffffffff8105240e>] ? flush_tlb_page+0x5e/0xb0 Feb 16 06:06:56 [<ffffffff811721c2>] ? try_to_merge_with_ksm_page+0x532/0x660 Feb 16 06:06:56 [<ffffffff811731a4>] ? ksm_scan_thread+0xeb4/0x1120 Feb 16 06:06:56 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 16 06:06:56 [<ffffffff811722f0>] ? ksm_scan_thread+0x0/0x1120 Feb 16 06:06:56 [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 16 06:06:56 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 16 06:06:56 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 16 06:06:56 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 16 06:07:01 Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1 Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:07:01 Call Trace: Feb 16 06:07:01 <NMI> Feb 16 06:07:01 [<ffffffff81530bdc>] ? panic+0xa7/0x16f Feb 16 06:07:01 [<ffffffff81014959>] ? sched_clock+0x9/0x10 Feb 16 06:07:01 [<ffffffff810ea65d>] ? watchdog_overflow_callback+0xcd/0xd0 Feb 16 06:07:01 [<ffffffff81120e07>] ? __perf_event_overflow+0xa7/0x240 Feb 16 06:07:01 [<ffffffff81119e14>] ? perf_event_update_userpage+0x24/0x110 Feb 16 06:07:01 [<ffffffff81121454>] ? perf_event_overflow+0x14/0x20 Feb 16 06:07:01 [<ffffffff8101e3fb>] ? x86_pmu_handle_irq+0x1eb/0x250 Feb 16 06:07:01 [<ffffffff81535ed9>] ? perf_event_nmi_handler+0x39/0xb0 Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff810a4ede>] ? notify_die+0x2e/0x30 Feb 16 06:07:01 [<ffffffff8153565b>] ? do_nmi+0x1bb/0x340 Feb 16 06:07:01 [<ffffffff81534f20>] ? nmi+0x20/0x30 Feb 16 06:07:01 [<ffffffff8153478e>] ? _spin_lock+0x1e/0x30 Feb 16 06:07:01 <<EOE>> Feb 16 06:07:01 [<ffffffff8114fdd3>] ? handle_pte_fault+0x833/0xb00 Feb 16 06:07:01 [<ffffffffa03987da>] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm] Feb 16 06:07:01 [<ffffffff811502ca>] ? handle_mm_fault+0x22a/0x300 Feb 16 06:07:01 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480 Feb 16 06:07:01 [<ffffffff8105d7d1>] ? update_curr+0xe1/0x1f0 Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bc0e>] ? invalidate_interrupt0+0xe/0x20 Feb 16 06:07:01 [<ffffffff81060c0c>] ? finish_task_switch+0x4c/0xf0 Feb 16 06:07:01 [<ffffffff815378de>] ? do_page_fault+0x3e/0xa0 Feb 16 06:07:01 [<ffffffff81534c95>] ? page_fault+0x25/0x30 Feb 16 06:07:01 [<ffffffff8129e862>] ? copy_user_generic_string+0x32/0x40 Feb 16 06:07:01 [<ffffffffa03926ab>] ? kvm_write_guest_cached+0x7b/0xa0 [kvm] Feb 16 06:07:01 [<ffffffffa03bf61f>] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm] Feb 16 06:07:01 [<ffffffffa03bdfb8>] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm] Feb 16 06:07:01 [<ffffffffa03ac24d>] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm] Feb 16 06:07:01 [<ffffffff810b2b73>] ? futex_wake+0x93/0x150 Feb 16 06:07:01 [<ffffffffa0392b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm] Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 Feb 16 06:07:01 [<ffffffff811a3e92>] ? vfs_ioctl+0x22/0xa0 Feb 16 06:07:01 [<ffffffff811a435a>] ? do_vfs_ioctl+0x3aa/0x580 Feb 16 06:07:01 [<ffffffff811a45b1>] ? sys_ioctl+0x81/0xa0 Feb 16 06:07:01 [<ffffffff810e5afe>] ? __audit_syscall_exit+0x25e/0x290 Feb 16 06:07:01 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 drm_kms_helper: panic occurred, switching back to text console Feb 16 06:07:01 BUG: scheduling while atomic: qemu-kvm/1950/0x14010000 Feb 16 06:07:01 Modules linked in: Feb 16 06:07:01 nf_nat Feb 16 06:07:01 mpt3sas Feb 16 06:07:01 mpt2sas Feb 16 06:07:01 raid_class Feb 16 06:07:01 mptctl Feb 16 06:07:01 ipmi_si Feb 16 06:07:01 ipmi_devintf Feb 16 06:07:01 netconsole Feb 16 06:07:01 configfs Feb 16 06:07:01 ebtable_nat Feb 16 06:07:01 ebtables Feb 16 06:07:01 nfs Feb 16 06:07:01 lockd Feb 16 06:07:01 fscache Feb 16 06:07:01 auth_rpcgss Feb 16 06:07:01 nfs_acl Feb 16 06:07:01 sunrpc Feb 16 06:07:01 bridge Feb 16 06:07:01 stp Feb 16 06:07:01 llc Feb 16 06:07:01 ipt_REJECT Feb 16 06:07:01 nf_conntrack_ipv4 Feb 16 06:07:01 nf_defrag_ipv4 Feb 16 06:07:01 iptable_filter Feb 16 06:07:01 ip_tables Feb 16 06:07:01 ip6t_REJECT Feb 16 06:07:01 nf_conntrack_ipv6 Feb 16 06:07:01 nf_defrag_ipv6 Feb 16 06:07:01 xt_state Feb 16 06:07:01 nf_conntrack Feb 16 06:07:01 ip6table_filter Feb 16 06:07:01 ip6_tables Feb 16 06:07:01 ipv6 Feb 16 06:07:01 dm_snapshot Feb 16 06:07:01 dm_bufio Feb 16 06:07:01 dm_zero Feb 16 06:07:01 vhost_net Feb 16 06:07:01 macvtap Feb 16 06:07:01 macvlan Feb 16 06:07:01 tun Feb 16 06:07:01 kvm_amd Feb 16 06:07:01 kvm Feb 16 06:07:01 ipmi_msghandler Feb 16 06:07:01 dcdbas Feb 16 06:07:01 serio_raw Feb 16 06:07:01 bnx2 Feb 16 06:07:01 k10temp Feb 16 06:07:01 amd64_edac_mod Feb 16 06:07:01 edac_core Feb 16 06:07:01 edac_mce_amd Feb 16 06:07:01 sg Feb 16 06:07:01 i2c_piix4 Feb 16 06:07:01 shpchp Feb 16 06:07:01 ext4 Feb 16 06:07:01 jbd2 Feb 16 06:07:01 mbcache Feb 16 06:07:01 sd_mod Feb 16 06:07:01 crc_t10dif Feb 16 06:07:01 mptsas Feb 16 06:07:01 mptscsih Feb 16 06:07:01 mptbase Feb 16 06:07:01 scsi_transport_sas Feb 16 06:07:01 ata_generic Feb 16 06:07:01 pata_acpi Feb 16 06:07:01 sata_svw Feb 16 06:07:01 radeon Feb 16 06:07:01 ttm Feb 16 06:07:01 drm_kms_helper Feb 16 06:07:01 drm Feb 16 06:07:01 i2c_algo_bit Feb 16 06:07:01 i2c_core Feb 16 06:07:01 dm_mirror Feb 16 06:07:01 dm_region_hash Feb 16 06:07:01 dm_log Feb 16 06:07:01 dm_mod Feb 16 06:07:01 [last unloaded: dell_rbu] Feb 16 06:07:01 192.168.13.230 Feb 16 06:07:01 Pid: 1950, comm: qemu-kvm Not tainted 2.6.32-504.8.1.el6.centos.plus.x86_64 #1 Feb 16 06:07:01 Call Trace: Feb 16 06:07:01 <NMI> Feb 16 06:07:01 [<ffffffff81060bb6>] ? __schedule_bug+0x66/0x70 Feb 16 06:07:01 [<ffffffff8153193c>] ? thread_return+0x6ac/0x7d0 Feb 16 06:07:01 [<ffffffffa002e35d>] ? write_msg+0xfd/0x110 [netconsole] Feb 16 06:07:01 [<ffffffffa00b2d0e>] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper] Feb 16 06:07:01 [<ffffffff8106c85a>] ? __cond_resched+0x2a/0x40 Feb 16 06:07:01 [<ffffffff81531d30>] ? _cond_resched+0x30/0x40 Feb 16 06:07:01 [<ffffffff81174e18>] ? __kmalloc+0x138/0x230 Feb 16 06:07:01 [<ffffffff810ba332>] ? __module_text_address+0x12/0x60 Feb 16 06:07:01 [<ffffffffa00b2d0e>] ? drm_crtc_helper_set_config+0x1be/0xa60 [drm_kms_helper] Feb 16 06:07:01 [<ffffffffa013df27>] ? r100_mm_wreg+0x67/0x90 [radeon] Feb 16 06:07:01 [<ffffffffa01332d2>] ? radeon_crtc_cursor_set+0x92/0x6e0 [radeon] Feb 16 06:07:01 [<ffffffffa005e40c>] ? drm_mode_set_config_internal+0x5c/0xe0 [drm] Feb 16 06:07:01 [<ffffffffa00b0653>] ? drm_fb_helper_restore_fbdev_mode+0xb3/0xe0 [drm_kms_helper] Feb 16 06:07:01 [<ffffffffa00b0788>] ? drm_fb_helper_panic+0x78/0xa0 [drm_kms_helper] Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff81530c07>] ? panic+0xd2/0x16f Feb 16 06:07:01 [<ffffffff81014959>] ? sched_clock+0x9/0x10 Feb 16 06:07:01 [<ffffffff810ea65d>] ? watchdog_overflow_callback+0xcd/0xd0 Feb 16 06:07:01 [<ffffffff81120e07>] ? __perf_event_overflow+0xa7/0x240 Feb 16 06:07:01 [<ffffffff81119e14>] ? perf_event_update_userpage+0x24/0x110 Feb 16 06:07:01 [<ffffffff81121454>] ? perf_event_overflow+0x14/0x20 Feb 16 06:07:01 [<ffffffff8101e3fb>] ? x86_pmu_handle_irq+0x1eb/0x250 Feb 16 06:07:01 [<ffffffff81535ed9>] ? perf_event_nmi_handler+0x39/0xb0 Feb 16 06:07:01 [<ffffffff81537995>] ? notifier_call_chain+0x55/0x80 Feb 16 06:07:01 [<ffffffff815379fa>] ? atomic_notifier_call_chain+0x1a/0x20 Feb 16 06:07:01 [<ffffffff810a4ede>] ? notify_die+0x2e/0x30 Feb 16 06:07:01 [<ffffffff8153565b>] ? do_nmi+0x1bb/0x340 Feb 16 06:07:01 [<ffffffff81534f20>] ? nmi+0x20/0x30 Feb 16 06:07:01 [<ffffffff8153478e>] ? _spin_lock+0x1e/0x30 Feb 16 06:07:01 <<EOE>> Feb 16 06:07:01 [<ffffffff8114fdd3>] ? handle_pte_fault+0x833/0xb00 Feb 16 06:07:01 [<ffffffffa03987da>] ? kvm_ioapic_update_eoi+0x8a/0xf0 [kvm] Feb 16 06:07:01 [<ffffffff811502ca>] ? handle_mm_fault+0x22a/0x300 Feb 16 06:07:01 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480 Feb 16 06:07:01 [<ffffffff8105d7d1>] ? update_curr+0xe1/0x1f0 Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bc0e>] ? invalidate_interrupt0+0xe/0x20 Feb 16 06:07:01 [<ffffffff81060c0c>] ? finish_task_switch+0x4c/0xf0 Feb 16 06:07:01 [<ffffffff815378de>] ? do_page_fault+0x3e/0xa0 Feb 16 06:07:01 [<ffffffff81534c95>] ? page_fault+0x25/0x30 Feb 16 06:07:01 [<ffffffff8129e862>] ? copy_user_generic_string+0x32/0x40 Feb 16 06:07:01 [<ffffffffa03926ab>] ? kvm_write_guest_cached+0x7b/0xa0 [kvm] Feb 16 06:07:01 [<ffffffffa03bf61f>] ? kvm_lapic_sync_to_vapic+0xcf/0x220 [kvm] Feb 16 06:07:01 [<ffffffffa03bdfb8>] ? kvm_apic_has_interrupt+0x48/0xd0 [kvm] Feb 16 06:07:01 [<ffffffffa03ac24d>] ? kvm_arch_vcpu_ioctl_run+0x93d/0x1010 [kvm] Feb 16 06:07:01 [<ffffffff810b2b73>] ? futex_wake+0x93/0x150 Feb 16 06:07:01 [<ffffffffa0392b04>] ? kvm_vcpu_ioctl+0x434/0x580 [kvm] Feb 16 06:07:01 [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70 Feb 16 06:07:01 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 Feb 16 06:07:01 [<ffffffff811a3e92>] ? vfs_ioctl+0x22/0xa0 Feb 16 06:07:01 [<ffffffff811a435a>] ? do_vfs_ioctl+0x3aa/0x580 Feb 16 06:07:01 [<ffffffff811a45b1>] ? sys_ioctl+0x81/0xa0 Feb 16 06:07:01 [<ffffffff810e5afe>] ? __audit_syscall_exit+0x25e/0x290 Feb 16 06:07:01 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Feb 16 06:07:01 Clocksource tsc unstable (delta = -77309385171 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.
I had a tail -f over ssh for a week, when this just happened.
Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880057a0a080) Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING
- Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: Initiating recovery Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880057a0a080) Write failed: Connection reset by peer
After reading https://access.redhat.com/solutions/108273, I am increasing the logging (shown below) but I am not confident about this wait and see approach.
sysctl -w dev.scsi.logging_level=98367
I am also going to check smartctl output once I get onsite to power cycle the system.
# smartctl -a /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-504.3.3.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST1500DM003-9YN16G Serial Number: W24153R0 LU WWN Device Id: 5 000c50 05d03cc1d Firmware Version: CC82 User Capacity: 1,500,301,910,016 bytes [1.50 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Feb 7 23:41:00 2015 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 194) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 181943016 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 17 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 39599363 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 821 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 17 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 062 045 Old_age Always - 33 (Min/Max 30/33) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 16 193 Load_Cycle_Count 0x0032 098 098 000 Old_age Always - 4551 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 267112606073648 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2764453802303 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3442873711291
SMART Error Log Version: 1 No Errors Logged
SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Other posts I have read, but I can not act on yet:
http://unix.stackexchange.com/questions/34173/mptscsih-ioc0-ta
sk-abort-success-rv-2002-causes-30-seconds-freezing
- https://bugzilla.kernel.org/show_bug.cgi?id=18652
- https://bugzilla.redhat.com/show_bug.cgi?id=483424
- https://bugzilla.kernel.org/show_bug.cgi?id=42765
- http://sourceforge.net/p/smartmontools/mailman/message/23849184/
- http://kb.softescu.ro/category/hardware/dell/
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
I think the panic is the consequence of drive write failure. So the actual problem is before the panic call trace. I'd post the entire dmesg somewhere wrap safe (either you mail agent or the forum is hard wrapping and is a pain to read).
What do you get for smartctl -x <dev>
In the meantime check or replace cables, usually it's the connectors that are faulty not the cable itself. Or replace the drive.
Chris Murphy
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 3:58
I think the panic is the consequence of drive write failure. So the actual problem is before the panic call trace.
Most of the time it panics without any warning, but once there was:
-----Original Message----- From: Jason Pyeron Sent: Sunday, February 08, 2015 0:00
-----Original Message----- From: Jason Pyeron Sent: Saturday, February 07, 2015 22:54
Feb 8 00:10:21 thirteen-230 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880057a0a080) Feb 8 00:10:21 thirteen-230 kernel: sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 1a 17 a1 6f 00 00 01 00 Feb 8 00:10:51 thirteen-230 kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000 Feb 8 00:10:51 thirteen-230 kernel: mptbase: ioc0: Initiating recovery Feb 8 00:11:13 thirteen-230 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880057a0a080)
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
wrap safe (either you mail agent or the forum is hard wrapping and is a pain to read).
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
In the meantime check or replace cables, usually it's the connectors that
It is a backplane, no "cables". I have reseated the parts.
are faulty not the cable itself. Or replace the drive.
I have replaced the drive (and reinstalled) already, the panics still happen once ever 30-40 hours.
Chris Murphy _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron jpyeron@pdinc.us wrote:
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
At least part of the problem happens before this log starts.
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the write (10) error seems isolated but it's still really suspicious, so I'd start replacing hardware.
I have replaced the drive (and reinstalled) already, the panics still happen once ever 30-40 hours.
The only thing that suggests it might not be hardware are all the kvm related messages in the kp. So if you've changed kernels, or VM configuration recently, then I'd revert. That's the limit of the most likely software explanation. If there's no recent software changes, then it must be hardware.
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 20:48
On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
At least part of the problem happens before this log starts.
Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8613 seconds. Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8735 seconds. Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 9224 seconds.
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the write (10) error seems isolated but it's still really suspicious, so I'd start replacing hardware.
Dell tech is enroute. New system board and disk controller.
I have replaced the drive (and reinstalled) already, the
panics still happen once ever 30-40 hours.
The only thing that suggests it might not be hardware are all the kvm related messages in the kp.
How so, each of the results I find say these are to be ignored.
So if you've changed kernels, or VM configuration recently, then I'd revert. That's the limit of the most
No changes from install out of the box.
likely software explanation. If there's no recent software changes, then it must be hardware.
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron jpyeron@pdinc.us wrote:
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 20:48
On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
At least part of the problem happens before this log starts.
Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8613 seconds. Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8735 seconds. Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0 Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6) Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6) Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 9224 seconds.
Doesn't seem related.
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the write (10) error seems isolated but it's still really suspicious, so I'd start replacing hardware.
Dell tech is enroute. New system board and disk controller.
I'm curious what they replace.
I have replaced the drive (and reinstalled) already, the
panics still happen once ever 30-40 hours.
The only thing that suggests it might not be hardware are all the kvm related messages in the kp.
How so, each of the results I find say these are to be ignored.
Well I found two older kernel bugs similar to this that suggested the problem stopped happening when running kvm with 1vcpu, and in another case when the VM was rebuilt 32-bit instead of 64-bit. But my ability to read kernel call traces is very limited, I really don't know what's going on.
If it's a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32 is really old, granted the centos one you're running has a huge pile of backports that makes it less "ancient" from a stability perspective, but anything really new that's hard to backport likely isn't in that kernel. While you're waiting for Dell you could try either:
kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
What's running in the VM?
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 23:38
On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote:
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 20:48
On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
At least part of the problem happens before this log starts.
<snip/>
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on
br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from
192.168.5.58 (xid=0x48d081b6)
Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to
192.168.13.230 -- renewal in 9224 seconds.
Doesn't seem related.
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the
write (10)
error seems isolated but it's still really suspicious, so I'd start replacing hardware.
Dell tech is enroute. New system board and disk controller.
I'm curious what they replace.
Both, but the backplane is not on the replacement list.
I have replaced the drive (and reinstalled) already, the
panics still happen once ever 30-40 hours.
The only thing that suggests it might not be hardware are
all the kvm
related messages in the kp.
How so, each of the results I find say these are to be ignored.
Well I found two older kernel bugs similar to this that suggested the problem stopped happening when running kvm with 1vcpu, and in another case when the VM was rebuilt 32-bit instead of 64-bit. But my ability to read kernel call traces is very limited, I really don't know what's going on.
I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd
If it's a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32 is really old, granted the centos one you're running has a huge pile of backports that makes it less "ancient" from a stability
We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.
perspective, but anything really new that's hard to backport likely isn't in that kernel. While you're waiting for Dell you could try either:
kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other.
Besides I am doing the migration right now to another host.
What's running in the VM?
Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time.
Thanks for the thoughs on this.
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.
On Tue, Feb 17, 2015 at 10:02 PM, Jason Pyeron jpyeron@pdinc.us wrote:
I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd
20 other identical systems doing the same work strongly suggests hardware problem when there's a single outlier.
If it's a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32 is really old, granted the centos one you're running has a huge pile of backports that makes it less "ancient" from a stability
We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.
I've been using it since Fedora 15, I find it easier to use to troubleshoot boot and service startup problems. systemd-analyze blame/plot are quite useful for boot performance optimizing. The journal on Fedora these days is persistent, on CentOS it's volatile with rsyslog running by default; but I like being able to journalctl -b-2 or b-3 to view previous boots, or point all systems to a single server, and sealing the journal logs against tampering, etc. It's certainly different, but wasn't onerous to get used to, and these days I prefer it.
perspective, but anything really new that's hard to backport likely isn't in that kernel. While you're waiting for Dell you could try either:
kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other.
I'd say it's unnecessary at this point. It's almost certainly a hardware problem given the numerous identical setups not having this problem. But, seeing as it panics every 30-40 hours, it can hardly be much worse with a new kernel running for a couple days... but my bet is there'd be no change.