Hi ! On an 8-node cluster, one of the node did a kernel panic.
The only bit of information I have is on a ssh console I had open, which said :
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:------------[ cut here ]------------
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:invalid opcode: 0000 [#1] SMP
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Stack:
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Call Trace:
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Code: 01 00 00 e8 26 8a cd e0 85 c0 0f 85 0e ff ff ff 48 89 df e8 76 f8 ff ff e9 01 ff ff ff 31 d2 eb d4 48 89 de 31 ff e8 c3 e3 ff ff <0f> 0b eb fe 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Kernel panic - not syncing: Fatal exception
From this, is there a way to determine the cause ? kdump is not confirgured nor used, since the fencing of the node renders kdump useless.
This is the second time in a few weeks it happens.
On 12/14/2011 8:49 PM, Nicolas Ross wrote:
From this, is there a way to determine the cause ? kdump is not confirgured nor used, since the fencing of the node renders kdump useless.
This is the second time in a few weeks it happens.
/var/log/messages should have more information; could you include it?
From this, is there a way to determine the cause ? kdump is not confirgured nor used, since the fencing of the node renders kdump useless.
This is the second time in a few weeks it happens.
/var/log/messages should have more information; could you include it?
No, unfortunently, the last message in the log is a normal one, an after that it's the boot process.
I will look at netconsole as Ross suggested.
Regards,
Hello Corey,
On Wed, 2011-12-14 at 20:50 -0700, Corey Henderson wrote:
/var/log/messages should have more information; could you include it?
Please do not ask people to include log files or other attachments to a public mailing list! Information like that should be pasted online (f.e. at http://pastebin.com/ ) and a link to the resource should be used.
Regards, Leonard.
On Thursday, December 15, 2011 09:51:52 AM Leonard den Ottolander wrote:
Please do not ask people to include log files or other attachments to a public mailing list! Information like that should be pasted online (f.e. at http://pastebin.com/ ) and a link to the resource should be used.
I must disagree with this; for IRC this is appropriate, since typical IRC chat logs are not indexed by google and the like, nor are questioners encouraged to read the archives of the IRC logs.
I can't count the times I've searched for a solution to a problem, found someone with the same issue posting online, tracked down some potential solution, only to find that the pastebin referenced as having the solution was no longer there.
Ditto for links to fixes on rapidshare, megaupload, googledocs, and kin. It would be nice to excerpt logs and fixes for future searching through google or directly through the archives.
Or, to put it more bluntly, you shouldn't tell people to search the archives but then have people put essential data on an ephemeral resource that is dissociated from the archive.
IMHO, of course.
On Dec 14, 2011, at 10:49 PM, Nicolas Ross rossnick-lists@cybercat.ca wrote:
Hi ! On an 8-node cluster, one of the node did a kernel panic.
The only bit of information I have is on a ssh console I had open, which said :
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:------------[ cut here ]------------
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:invalid opcode: 0000 [#1] SMP
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Stack:
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Call Trace:
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Code: 01 00 00 e8 26 8a cd e0 85 c0 0f 85 0e ff ff ff 48 89 df e8 76 f8 ff ff e9 01 ff ff ff 31 d2 eb d4 48 89 de 31 ff e8 c3 e3 ff ff <0f> 0b eb fe 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
Message from syslogd@node108 at Dec 14 19:00:15 ... kernel:Kernel panic - not syncing: Fatal exception
From this, is there a way to determine the cause ? kdump is not confirgured nor used, since the fencing of the node renders kdump useless.
This is the second time in a few weeks it happens.
Setup netconsole to log kernel messages to the node on the "left". Then you can get the the oops messages if any node crashes.
-Ross