[CentOS] system unresponsive

Thu May 23 17:03:29 UTC 2019
mark <m.roth at 5-cent.us>

Jon Pruente wrote:
> On Wed, May 22, 2019 at 10:02 AM mark <m.roth at 5-cent.us> wrote:
>> That seems unlikely. Foe one, I've seen that... but I *always* see
>> entries in the log about the oom-killer being invoked. For another, this
>> isn't a compute node, it's *only* a fileserver, serving projects, home
>> directories, and backups (home-grown b/u, uses rsync), and backups
>> don't start until well after midnight, and as we're business-hours only,
>> there was less usage, and it does have 256G RAM....
> I have two servers that would lock up like this occasionally, and if I
> let them sit at the console long enough sometimes they would give a login
> prompt. It took a lot of time and frustration (these are prod servers)
> but I tracked it down to a problem in the XFS driver, as it never occurred
> on the systems with EXT4 filesystems. The XFS driver would hang,
> preventing writes to the filesystem. I could identify exactly when that
> happened as all system logging would suddenly stop at the same second.
> Then OOMKiller
> would come in and start killing off processes but that wouldn't be in the
> logs on disk because the file system couldn't write. I rolled the servers
>  back to a 5xx series kernel and the issue didn't resurface. I recently
> let them boot the newer 9xx series kernels and I'm hoping the XFS issue is
>  fixed.

I have no idea if that's it... and the cluster nodes that would have it
happen, a few years ago, were ext4.

Crap - I just went to look on the system that died, and from sar, I see
that it died between 18:10 and 18:20, and we found it unresponsive when I
got in at 09:00. I'd think that was enuogh time to print something.