On Wed, May 22, 2019 at 10:02 AM mark <m.roth at 5-cent.us> wrote: > That seems unlikely. Foe one, I've seen that... but I *always* see entries > in the log about the oom-killer being invoked. For another, this isn't a > compute node, it's *only* a fileserver, serving projects, home > directories, and backups (home-grown b/u, uses rsync), and backups don't > start until well after midnight, and as we're business-hours only, there > was less usage, and it does have 256G RAM.... > I have two servers that would lock up like this occasionally, and if I let them sit at the console long enough sometimes they would give a login prompt. It took a lot of time and frustration (these are prod servers) but I tracked it down to a problem in the XFS driver, as it never occurred on the systems with EXT4 filesystems. The XFS driver would hang, preventing writes to the filesystem. I could identify exactly when that happened as all system logging would suddenly stop at the same second. Then OOMKiller would come in and start killing off processes but that wouldn't be in the logs on disk because the file system couldn't write. I rolled the servers back to a 5xx series kernel and the issue didn't resurface. I recently let them boot the newer 9xx series kernels and I'm hoping the XFS issue is fixed.