[CentOS] CentOS 6 spontaneous reboots

Mon May 30 19:46:55 UTC 2016
Bill Gee <bgee at campercaver.net>

Hello everyone -

I found the core dumps.  They are in /var/crash.  This directory contains a 
directory for each crash, named by IP address-date-time.  Each directory 
contains a vmcore and a vmcore-dmesg.txt file.

The vmcore-dmesg.txt files are mostly the kernel initialization stuff, same as you 
would see in dmesg.  At the end, though, is some information about the 
process that was executing when the crash happened.  

I reviewed several of those and found a common process - aiccu!  That seems 
very odd since I have been running aiccu and Sixxs for over five years.  It has 
never given me any trouble before.  The package I have on this server came 
from the EPEL repository and has not changed for several years.  The Sixxs web 
site also shows no change in aiccu for many years.

I also found, by chance, an operation that seems to always trigger the crash.  If I 
go to my main workstation (Fedora 23) and tell Akregator to "refresh all feeds", 
that is guaranteed to produce a crash.  There are probably other operations that 
can force a crash, but I have not found them.  

For now I have turned off ipv6 forwarding and stopped the radvd service.  That 
should keep aiccu from handling anything.

It is nice to know it is not some funky hardware problem.  Still, it would be nice to 
have it working.  Any thoughts?

Thanks - Bill Gee


On Sunday, May 29, 2016 17:48:09 Keith Keller wrote:
> Hi Bill,
> 
> On 2016-05-30, Bill Gee <bgee at campercaver.net> wrote:
> > By luck I saw the beginning of a reboot on the server console.  Normally I
> > have other systems up on the KVM switch.  It appears to have dumped 
core.
> >  I don't know where to look for the core dump files.  They are not in
> > /root.
> One place you might check is under /var/lib.  I think there may be a
> /var/lib/crash directory which contains core dumps.
> 
> > I ran MemTest 86+.  No memory errors were found.
> 
> Another option is to try Advanced Cluster Breakin, which runs other
> tests besides memory.
> 
> http://www.advancedclustering.com/products/software/breakin/
> 
> I've had it find problems that memtest hasn't (and vice-versa).
> 
> > Lm_sensors shows the processor running between 45 and 50C.
> 
> If the system supports IPMI, check those sensors and logs, there may be
> something useful there.  If you don't have IPMI, there may still be
> something in the BIOS logs (how you get to those varies wildly, you may
> need to boot into the BIOS to do it).
> 
> I hope that helps!
> 
> --keith