-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 23:38
On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote:
-----Original Message----- From: Chris Murphy Sent: Tuesday, February 17, 2015 20:48
On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote:
I'd post the entire dmesg somewhere
http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log
At least part of the problem happens before this log starts.
<snip/>
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on
br0 to 192.168.5.58 port 67 (xid=0x48d081b6)
Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from
192.168.5.58 (xid=0x48d081b6)
Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to
192.168.13.230 -- renewal in 9224 seconds.
Doesn't seem related.
What do you get for smartctl -x <dev>
http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log
OK no smart extended test has been done, but also no pending bad or relocated sectors, and no phy event errors either. So the
write (10)
error seems isolated but it's still really suspicious, so I'd start replacing hardware.
Dell tech is enroute. New system board and disk controller.
I'm curious what they replace.
Both, but the backplane is not on the replacement list.
I have replaced the drive (and reinstalled) already, the
panics still happen once ever 30-40 hours.
The only thing that suggests it might not be hardware are
all the kvm
related messages in the kp.
How so, each of the results I find say these are to be ignored.
Well I found two older kernel bugs similar to this that suggested the problem stopped happening when running kvm with 1vcpu, and in another case when the VM was rebuilt 32-bit instead of 64-bit. But my ability to read kernel call traces is very limited, I really don't know what's going on.
I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd
If it's a kernel bug though, you could maybe clobber it with a substantially newer kernel. You might check out elrepo kernels. 2.6.32 is really old, granted the centos one you're running has a huge pile of backports that makes it less "ancient" from a stability
We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too.
perspective, but anything really new that's hard to backport likely isn't in that kernel. While you're waiting for Dell you could try either:
kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm
Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other.
Besides I am doing the migration right now to another host.
What's running in the VM?
Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time.
Thanks for the thoughs on this.
-Jason
-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.