Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note that there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
On Wed, 22 May 2019 at 09:30, mark m.roth@5-cent.us wrote:
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note that there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
From the above description, I would normally say it sounds like hardware.
However, why do you say the system is not dead when you plug in a console.. but there is nothing on the screen and it doesn't respond to control-alt-delete. To me that sounds like 'dead'. Usually the cpu is hardlocked or the hardware went into 'over-heat' and put everything in a deep sleep hoping it would cool down but never wake up.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
In the past I've found that the console may have blanked (due to time) and when the system locked up/hung it won't unblank. Booting with "consoleblank=0" on the kernel command line will ensure that whatever is printed to the console (oops, panic, etc) will be there for you to see when you connect.
I've had intermittent success in that type of situation with the SysRq key(s), see https://en.wikipedia.org/wiki/Magic_SysRq_key . They do require that you have it configured/enabled ahead of time. If you access the console over a BMC/IPMI KVM session it can be very difficult, if not impossible, to enter the keystroke as well.
Good luck, Scott
On Wed, May 22, 2019 at 8:46 AM Stephen John Smoogen smooge@gmail.com wrote:
On Wed, 22 May 2019 at 09:30, mark m.roth@5-cent.us wrote:
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note
that
there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
From the above description, I would normally say it sounds like hardware. However, why do you say the system is not dead when you plug in a console.. but there is nothing on the screen and it doesn't respond to control-alt-delete. To me that sounds like 'dead'. Usually the cpu is hardlocked or the hardware went into 'over-heat' and put everything in a deep sleep hoping it would cool down but never wake up.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- Stephen J Smoogen. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Scott Silverman wrote:
In the past I've found that the console may have blanked (due to time) and when the system locked up/hung it won't unblank. Booting with "consoleblank=0" on the kernel command line will ensure that whatever is printed to the console (oops, panic, etc) will be there for you to see when you connect.
I've had intermittent success in that type of situation with the SysRq key(s), see https://en.wikipedia.org/wiki/Magic_SysRq_key . They do require that you have it configured/enabled ahead of time. If you access the console over a BMC/IPMI KVM session it can be very difficult, if not impossible, to enter the keystroke as well.
Hmmm... thanks. I'm sure I've heard about the magic sysreq, but had forgotten, never used it. I'll try that if this happens again.
mark
Good luck, Scott
On Wed, May 22, 2019 at 8:46 AM Stephen John Smoogen smooge@gmail.com wrote:
On Wed, 22 May 2019 at 09:30, mark m.roth@5-cent.us wrote:
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note
that
there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
From the above description, I would normally say it sounds like hardware. However, why do you say the system is not dead when you plug in a console.. but there is nothing on the screen and it doesn't respond to control-alt-delete. To me that sounds like 'dead'. Usually the cpu is hardlocked or the hardware went into 'over-heat' and put everything in a deep sleep hoping it would cool down but never wake up.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- Stephen J Smoogen. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- DISCLAIMER: NOTICE REGARDING PRIVACY AND CONFIDENTIALITY
The information contained in and/or accompanying this communication is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this information, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy of any e-mail and any printout thereof. Electronic transmissions cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. Simplex Trading, LLC and its affiliates reserves the right to intercept, monitor, and retain electronic communications to and from its system as permitted by law. Simplex Trading, LLC is a registered Broker Dealer with CBOE and a Member of SIPC. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Out of memory? We’ve definitely seen similar symptoms (it’s been a while, so I’m not sure they were identical) for compute nodes running large memory jobs.
Noam
Noam Bernstein via CentOS wrote:
Out of memory? We’ve definitely seen similar symptoms (it’s been a while, so I’m not sure they were identical) for compute nodes running large memory jobs.
That seems unlikely. Foe one, I've seen that... but I *always* see entries in the log about the oom-killer being invoked. For another, this isn't a compute node, it's *only* a fileserver, serving projects, home directories, and backups (home-grown b/u, uses rsync), and backups don't start until well after midnight, and as we're business-hours only, there was less usage, and it does have 256G RAM....
mark
On Wed, May 22, 2019 at 10:02 AM mark m.roth@5-cent.us wrote:
That seems unlikely. Foe one, I've seen that... but I *always* see entries in the log about the oom-killer being invoked. For another, this isn't a compute node, it's *only* a fileserver, serving projects, home directories, and backups (home-grown b/u, uses rsync), and backups don't start until well after midnight, and as we're business-hours only, there was less usage, and it does have 256G RAM....
I have two servers that would lock up like this occasionally, and if I let them sit at the console long enough sometimes they would give a login prompt. It took a lot of time and frustration (these are prod servers) but I tracked it down to a problem in the XFS driver, as it never occurred on the systems with EXT4 filesystems. The XFS driver would hang, preventing writes to the filesystem. I could identify exactly when that happened as all system logging would suddenly stop at the same second. Then OOMKiller would come in and start killing off processes but that wouldn't be in the logs on disk because the file system couldn't write. I rolled the servers back to a 5xx series kernel and the issue didn't resurface. I recently let them boot the newer 9xx series kernels and I'm hoping the XFS issue is fixed.
Jon Pruente wrote:
On Wed, May 22, 2019 at 10:02 AM mark m.roth@5-cent.us wrote:
That seems unlikely. Foe one, I've seen that... but I *always* see entries in the log about the oom-killer being invoked. For another, this isn't a compute node, it's *only* a fileserver, serving projects, home directories, and backups (home-grown b/u, uses rsync), and backups don't start until well after midnight, and as we're business-hours only, there was less usage, and it does have 256G RAM....
I have two servers that would lock up like this occasionally, and if I let them sit at the console long enough sometimes they would give a login prompt. It took a lot of time and frustration (these are prod servers) but I tracked it down to a problem in the XFS driver, as it never occurred on the systems with EXT4 filesystems. The XFS driver would hang, preventing writes to the filesystem. I could identify exactly when that happened as all system logging would suddenly stop at the same second. Then OOMKiller would come in and start killing off processes but that wouldn't be in the logs on disk because the file system couldn't write. I rolled the servers back to a 5xx series kernel and the issue didn't resurface. I recently let them boot the newer 9xx series kernels and I'm hoping the XFS issue is fixed.
I have no idea if that's it... and the cluster nodes that would have it happen, a few years ago, were ext4.
Crap - I just went to look on the system that died, and from sar, I see that it died between 18:10 and 18:20, and we found it unresponsive when I got in at 09:00. I'd think that was enuogh time to print something.
mark
You should be able to recognize or monitor this by configure the syslog to print everything on a specific TTY or use the remote logging functionality.
Kind regards Thomas
Am Do., 23. Mai 2019 um 18:31 Uhr schrieb Jon Pruente < jpruente@riskanalytics.com>:
On Wed, May 22, 2019 at 10:02 AM mark m.roth@5-cent.us wrote:
That seems unlikely. Foe one, I've seen that... but I *always* see
entries
in the log about the oom-killer being invoked. For another, this isn't a compute node, it's *only* a fileserver, serving projects, home directories, and backups (home-grown b/u, uses rsync), and backups don't start until well after midnight, and as we're business-hours only, there was less usage, and it does have 256G RAM....
I have two servers that would lock up like this occasionally, and if I let them sit at the console long enough sometimes they would give a login prompt. It took a lot of time and frustration (these are prod servers) but I tracked it down to a problem in the XFS driver, as it never occurred on the systems with EXT4 filesystems. The XFS driver would hang, preventing writes to the filesystem. I could identify exactly when that happened as all system logging would suddenly stop at the same second. Then OOMKiller would come in and start killing off processes but that wouldn't be in the logs on disk because the file system couldn't write. I rolled the servers back to a 5xx series kernel and the issue didn't resurface. I recently let them boot the newer 9xx series kernels and I'm hoping the XFS issue is fixed. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 5/22/19 6:57 AM, Scott Silverman wrote:
In the past I've found that the console may have blanked (due to time) and when the system locked up/hung it won't unblank. Booting with "consoleblank=0" on the kernel command line will ensure that whatever is printed to the console (oops, panic, etc) will be there for you to see when you connect.
I would definitely start here. If the system locks and there's no oops printed to the screen, then you almost certainly have a hardware issue. If there *is* an oops printed, you might still have a hardware issue, but the oops will probably give you some direction on tracking it down.
At some point, you'll probably want to schedule as many hours of down time as possible and run memtest86+
I've had intermittent success in that type of situation with the SysRq key(s), seehttps://en.wikipedia.org/wiki/Magic_SysRq_key .
If the console isn't coming back on keyboard activity, then the system is probably hard-locked, and sysrq keys aren't going to work. Probably.
Stephen John Smoogen wrote:
On Wed, 22 May 2019 at 09:30, mark m.roth@5-cent.us wrote:
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note that there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
From the above description, I would normally say it sounds like hardware. However, why do you say the system is not dead when you plug in a console.. but there is nothing on the screen and it doesn't respond to control-alt-delete. To me that sounds like 'dead'. Usually the cpu is hardlocked or the hardware went into 'over-heat' and put everything in a deep sleep hoping it would cool down but never wake up.
It seems unlikely. It's a 4U server, with 36 disks (and the dual root disks), in a machine room, and ipmitool sel list shows nada, nor are there any warnings, as I've seen on other systems occasionally, that the CPU is overheating, and is being throttled.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- Stephen J Smoogen. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Wed, May 22, 2019 at 10:22 AM mark m.roth@5-cent.us wrote:
It seems unlikely. It's a 4U server, with 36 disks (and the dual root disks), in a machine room, and ipmitool sel list shows nada, nor are there any warnings, as I've seen on other systems occasionally, that the CPU is overheating, and is being throttled.
If this is a recent sever (ivybridge/haswell/broadwell) then I’ve seen the “edac” kernel module prevent SEL from showing faults when a MCE/machine-check-exception occurs. Disable edac and poof server stops crashing and/or SEL shows something useful(ECC/MCE). Did you check /var/log/mcelog?
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note that there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
I saw such an issue recently and never found out what happened and why.
Regards, Simon