> -----Original Message----- > From: Chris Murphy > Sent: Tuesday, February 17, 2015 23:38 > > On Tue, Feb 17, 2015 at 7:34 PM, Jason Pyeron wrote: > >> -----Original Message----- > >> From: Chris Murphy > >> Sent: Tuesday, February 17, 2015 20:48 > >> > >> On Tue, Feb 17, 2015 at 7:54 AM, Jason Pyeron wrote: > >> >> I'd post the entire dmesg somewhere > >> > > >> > http://client.pdinc.us/panic-341e97c30b5a4cb774942bae32d3f163.log > >> > >> At least part of the problem happens before this log starts. > > <snip/> > > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPREQUEST on > br0 to 192.168.5.58 port 67 (xid=0x48d081b6) > > Feb 16 04:30:30 thirteen-230 dhclient[1272]: DHCPACK from > 192.168.5.58 (xid=0x48d081b6) > > Feb 16 04:30:31 thirteen-230 dhclient[1272]: bound to > 192.168.13.230 -- renewal in 9224 seconds. > > Doesn't seem related. > > > > > >> > >> >> What do you get for > >> >> smartctl -x <dev> > >> > > >> > > http://client.pdinc.us/smartctl-2000e86b62db27169cc9307358ebf10e.log > >> > >> OK no smart extended test has been done, but also no pending bad or > >> relocated sectors, and no phy event errors either. So the > write (10) > >> error seems isolated but it's still really suspicious, so I'd start > >> replacing hardware. > > > > Dell tech is enroute. New system board and disk controller. > > I'm curious what they replace. Both, but the backplane is not on the replacement list. > > > > >> > >> > >> > I have replaced the drive (and reinstalled) already, the > >> panics still happen once ever 30-40 hours. > >> > >> The only thing that suggests it might not be hardware are > all the kvm > >> related messages in the kp. > > > > How so, each of the results I find say these are to be ignored. > > Well I found two older kernel bugs similar to this that suggested the > problem stopped happening when running kvm with 1vcpu, and in another > case when the VM was rebuilt 32-bit instead of 64-bit. But my ability > to read kernel call traces is very limited, I really don't know what's > going on. > I can say, we have about 20 of the identical systems, doing the same work. PE2970 running RHEL6/Centos6 and libvirtd > If it's a kernel bug though, you could maybe clobber it with a > substantially newer kernel. You might check out elrepo kernels. 2.6.32 > is really old, granted the centos one you're running has a huge pile > of backports that makes it less "ancient" from a stability We should start looking at Centos7/RHEL7, ug systemd..... But these machines are ancient too. > perspective, but anything really new that's hard to backport likely > isn't in that kernel. While you're waiting for Dell you could try > either: > > kernel-ml-3.18.6-1.el6.elrepo.x86_64.rpm > kernel-ml-3.19.0-1.el6.elrepo.x86_64.rpm Unlikly, since I do not have a test plan. If I could reproduce the error on demand then it would be a valid experiment. Some of the systems are running RHEL6 which are under support, while the others are Centos6. The configs are kept as close as possible to each other. Besides I am doing the migration right now to another host. > > What's running in the VM? Mostly RHEL6/Centos6 VMs. But there are some windows systems too. This system was handling most of the CipherShed.org Jenkins CI farm. I can say the resources are oversubscribed by a 15x. But the system runs at below 0.10 at any random time. Thanks for the thoughs on this. -Jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00.