[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

> -----Original Message-----
> From: Chris Murphy
> Sent: Tuesday, February 17, 2015 23:38
> 
> On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote:
> >> -----Original Message-----
> >> From: Chris Murphy
> >> Sent: Tuesday, February 17, 2015 20:48
> >>
> >> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
> >> >> I'd post the entire dmesg somewhere
> >> >
> >> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
> >>
> >> At least part of the problem happens before this log starts.
> >
<snip/>
> > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on 
> br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
> > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from 
> 192.168.5.58 (xid=0x48d081b6)
> > Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to 
> 192.168.13.230 -- renewal in 9224 seconds.
> 
> Doesn't seem related.
> 
> 
> >
> >>
> >> >> What do you get for
> >> >> smartctl -x <dev>
> >> >
> >> > 
> http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
> >>
> >> OK no smart extended test has been done, but also no pending bad or
> >> relocated sectors, and no phy event errors either. So the 
> write (10)
> >> error seems isolated but it's still really suspicious, so I'd start
> >> replacing hardware.
> >
> > Dell tech is enroute. New system board and disk controller.
> 
> I'm curious what they replace.

Both, but the backplane is not on the replacement list.

> 
> >
> >>
> >>
> >> > I have replaced the drive (and reinstalled) already, the
> >> panics still happen once ever 30-40 hours.
> >>
> >> The only thing that suggests it might not be hardware are 
> all the kvm
> >> related messages in the kp.
> >
> > How so, each of the results I find say these are to be ignored.
> 
> Well I found two older kernel bugs similar to this that suggested the
> problem stopped happening when running kvm with 1vcpu, and in another
> case when the VM was rebuilt 32-bit instead of 64-bit. But my ability
> to read kernel call traces is very limited, I really don't know what's
> going on.
> 

I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd

> If it's a kernel bug though, you could maybe clobber it with a
> substantially newer kernel. You might check out elrepo kernels. 2.6.32
> is really old, granted the centos one you're running has a huge pile
> of backports that makes it less "ancient" from a stability

We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.

> perspective, but anything really new that's hard to backport likely
> isn't in that kernel. While you're waiting for Dell you could try
> either:
> 
> kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm
> kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm

Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other.

Besides I am doing the migration right now to another host.

> 
> What's running in the VM?

Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time.

Thanks for the thoughs on this.

-Jason

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.