Stephen John Smoogen wrote:
On Wed, 22 May 2019 at 09:30, mark m.roth@5-cent.us wrote:
Ok, we used to get this occasionally on cluster nodes, and we just got it on a fileserver (very bad). The system is discovered to be unresponsive: it doesn't ping, and plugging a console in, you can see that it's not dead, but there nothing at all on the screen, nor does it respond to even <ctrl-alt-del>. The only answer is to power cycle it; it comes up fine.
Nothing in /var/log/dmesg or /var/log/messages. No abrts I can find. sar tells me it went unredponsive between 18:10 and 10:20 yesterday. Note that there are no further entries in sar, either, for yesterday, after the event, and nothing till I power cycled it.
From the above description, I would normally say it sounds like hardware. However, why do you say the system is not dead when you plug in a console.. but there is nothing on the screen and it doesn't respond to control-alt-delete. To me that sounds like 'dead'. Usually the cpu is hardlocked or the hardware went into 'over-heat' and put everything in a deep sleep hoping it would cool down but never wake up.
It seems unlikely. It's a 4U server, with 36 disks (and the dual root disks), in a machine room, and ipmitool sel list shows nada, nor are there any warnings, as I've seen on other systems occasionally, that the CPU is overheating, and is being throttled.
Has anyone else seen this - I can't imagine it's only us - or have any thoughts?
C 7, 7.6.1810
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- Stephen J Smoogen. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos