Jon Pruente wrote:
On Wed, May 22, 2019 at 10:02 AM mark m.roth@5-cent.us wrote:
That seems unlikely. Foe one, I've seen that... but I *always* see entries in the log about the oom-killer being invoked. For another, this isn't a compute node, it's *only* a fileserver, serving projects, home directories, and backups (home-grown b/u, uses rsync), and backups don't start until well after midnight, and as we're business-hours only, there was less usage, and it does have 256G RAM....
I have two servers that would lock up like this occasionally, and if I let them sit at the console long enough sometimes they would give a login prompt. It took a lot of time and frustration (these are prod servers) but I tracked it down to a problem in the XFS driver, as it never occurred on the systems with EXT4 filesystems. The XFS driver would hang, preventing writes to the filesystem. I could identify exactly when that happened as all system logging would suddenly stop at the same second. Then OOMKiller would come in and start killing off processes but that wouldn't be in the logs on disk because the file system couldn't write. I rolled the servers back to a 5xx series kernel and the issue didn't resurface. I recently let them boot the newer 9xx series kernels and I'm hoping the XFS issue is fixed.
I have no idea if that's it... and the cluster nodes that would have it happen, a few years ago, were ext4.
Crap - I just went to look on the system that died, and from sar, I see that it died between 18:10 and 18:20, and we found it unresponsive when I got in at 09:00. I'd think that was enuogh time to print something.
mark