On Wed, May 22, 2019 at 10:22 AM mark m.roth@5-cent.us wrote:
It seems unlikely. It's a 4U server, with 36 disks (and the dual root disks), in a machine room, and ipmitool sel list shows nada, nor are there any warnings, as I've seen on other systems occasionally, that the CPU is overheating, and is being throttled.
If this is a recent sever (ivybridge/haswell/broadwell) then I’ve seen the “edac” kernel module prevent SEL from showing faults when a MCE/machine-check-exception occurs. Disable edac and poof server stops crashing and/or SEL shows something useful(ECC/MCE). Did you check /var/log/mcelog?