Re: [CentOS] system unresponsive

23 May 2019


      Jon Pruente wrote:
...
On Wed, May 22, 2019 at 10:02 AM mark m.roth@5-cent.us wrote:
...
That seems unlikely. Foe one, I've seen that... but I *always* see
entries in the log about the oom-killer being invoked. For another, this
isn't a compute node, it's *only* a fileserver, serving projects, home
directories, and backups (home-grown b/u, uses rsync), and backups
don't start until well after midnight, and as we're business-hours only,
there was less usage, and it does have 256G RAM....
I have two servers that would lock up like this occasionally, and if I
let them sit at the console long enough sometimes they would give a login
prompt. It took a lot of time and frustration (these are prod servers)
but I tracked it down to a problem in the XFS driver, as it never occurred
on the systems with EXT4 filesystems. The XFS driver would hang,
preventing writes to the filesystem. I could identify exactly when that
happened as all system logging would suddenly stop at the same second.
Then OOMKiller
would come in and start killing off processes but that wouldn't be in the
logs on disk because the file system couldn't write. I rolled the servers
 back to a 5xx series kernel and the issue didn't resurface. I recently
let them boot the newer 9xx series kernels and I'm hoping the XFS issue is
 fixed.
I have no idea if that's it... and the cluster nodes that would have it
happen, a few years ago, were ext4.
Crap - I just went to look on the system that died, and from sar, I see
that it died between 18:10 and 18:20, and we found it unresponsive when I
got in at 09:00. I'd think that was enuogh time to print something.
mark

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] system unresponsive