[CentOS] system unresponsive

Thu May 23 04:15:58 UTC 2019
Steven Tardy <sjt5atra at gmail.com>

On Wed, May 22, 2019 at 10:22 AM mark <m.roth at 5-cent.us> wrote:

> It seems unlikely. It's a 4U server, with 36 disks (and the dual root
> disks), in a machine room, and ipmitool sel list shows nada, nor are there
> any warnings, as I've seen on other systems occasionally, that the CPU is
> overheating, and is being throttled.


If this is a recent sever (ivybridge/haswell/broadwell) then I’ve seen the
“edac” kernel module prevent SEL from showing faults when a
MCE/machine-check-exception occurs. Disable edac and poof server stops
crashing and/or SEL shows something useful(ECC/MCE). Did you check
/var/log/mcelog?