[CentOS] Cause for kernel panic

Thu Dec 15 04:13:25 UTC 2011
Ross Walker <rswwalker at gmail.com>

On Dec 14, 2011, at 10:49 PM, Nicolas Ross <rossnick-lists at cybercat.ca> wrote:

> Hi ! On an 8-node cluster, one of the node did a kernel panic.
> 
> The only bit of information I have is on a ssh console I had open, which 
> said :
> 
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:------------[ cut here ]------------
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:invalid opcode: 0000 [#1] SMP
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:last sysfs file: 
> /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:Stack:
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:Call Trace:
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:Code: 01 00 00 e8 26 8a cd e0 85 c0 0f 85 0e ff ff ff 48 89 df 
> e8 76 f8 ff ff e9 01 ff ff ff 31 d2 eb d4 48 89 de 31 ff e8 c3 e3 ff ff 
> <0f> 0b eb fe 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
> 
> Message from syslogd at node108 at Dec 14 19:00:15 ...
>  kernel:Kernel panic - not syncing: Fatal exception
> 
> 
> From this, is there a way to determine the cause ? kdump is not 
> confirgured nor used, since the fencing of the node renders kdump useless.
> 
> This is the second time in a few weeks it happens.

Setup netconsole to log kernel messages to the node on the "left". Then you can get the the oops messages if any node crashes.

-Ross