[CentOS] Intermittent problem, likely disk IO related - mptscsih: ioc0: attempting task abort!

Wed Feb 18 05:41:20 UTC 2015
Chris Murphy <lists at colorremedies.com>

On Tue, Feb 17, 2015 at 10:02 PM, Jason Pyeron <jpyeron at pdinc.us> wrote:

> I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd

20 other identical systems doing the same work strongly suggests
hardware problem when there's a single outlier.

>
>> If it's a kernel bug though, you could maybe clobber it with a
>> substantially newer kernel. You might check out elrepo kernels. 2.6.32
>> is really old, granted the centos one you're running has a huge pile
>> of backports that makes it less "ancient" from a stability
>
> We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.

I've been using it since Fedora 15, I find it easier to use to
troubleshoot boot and service startup problems. systemd-analyze
blame/plot are quite useful for boot performance optimizing. The
journal on Fedora these days is persistent, on CentOS it's volatile
with rsyslog running by default; but I like being able to journalctl
-b-2 or b-3 to view previous boots, or point all systems to a single
server, and sealing the journal logs against tampering, etc. It's
certainly different, but wasn't onerous to get used to, and these days
I prefer it.

>
>> perspective, but anything really new that's hard to backport likely
>> isn't in that kernel. While you're waiting for Dell you could try
>> either:
>>
>> kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm
>> kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
>
> Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other.

I'd say it's unnecessary at this point. It's almost certainly a
hardware problem given the numerous identical setups not having this
problem. But, seeing as it panics every 30-40 hours, it can hardly be
much worse with a new kernel running for a couple days... but my bet
is there'd be no change.


-- 
Chris Murphy