[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron <jpyeron at pdinc.us> wrote:
>> -----Original Message-----
>> From: Chris Murphy
>> Sent: Tuesday, February 17, 2015 20:48
>>
>> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
>> >> I'd post the entire dmesg somewhere
>> >
>> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
>>
>> At least part of the problem happens before this log starts.
>
> Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 15 23:41:19 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
> Feb 15 23:41:21 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8613 seconds.
> Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 16 02:04:54 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
> Feb 16 02:04:55 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 8735 seconds.
> Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0
> Feb 16 02:46:09 thirteen-230 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
> Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffffd8f0
> Feb 16 03:53:39 thirteen-230 kernel: kvm: 2161: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
> Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
> Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 192.168.5.58 (xid=0x48d081b6)
> Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 192.168.13.230 -- renewal in 9224 seconds.

Doesn't seem related.

>
>>
>> >> What do you get for
>> >> smartctl -x <dev>
>> >
>> > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
>>
>> OK no smart extended test has been done, but also no pending bad or
>> relocated sectors, and no phy event errors either. So the write (10)
>> error seems isolated but it's still really suspicious, so I'd start
>> replacing hardware.
>
> Dell tech is enroute. New system board and disk controller.

I'm curious what they replace.

>
>>
>>
>> > I have replaced the drive (and reinstalled) already, the
>> panics still happen once ever 30-40 hours.
>>
>> The only thing that suggests it might not be hardware are all the kvm
>> related messages in the kp.
>
> How so, each of the results I find say these are to be ignored.

Well I found two older kernel bugs similar to this that suggested the
problem stopped happening when running kvm with 1vcpu, and in another
case when the VM was rebuilt 32-bit instead of 64-bit. But my ability
to read kernel call traces is very limited, I really don't know what's
going on.

If it's a kernel bug though, you could maybe clobber it with a
substantially newer kernel. You might check out elrepo kernels. 2.6.32
is really old, granted the centos one you're running has a huge pile
of backports that makes it less "ancient" from a stability
perspective, but anything really new that's hard to backport likely
isn't in that kernel. While you're waiting for Dell you could try
either:

kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm
kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm

What's running in the VM?

-- 
Chris Murphy