[CentOS] troubleshooting kernel crash?

Sat Apr 29 14:39:29 UTC 2006
Kai Schaetzl <maillists at conactive.com>

I'm getting kernel crashes on a new machine since it went into production 
some weeks ago. How can I troubleshoot this? I suspect a hardware problem, 
but I know too few about the kernel and kernel debugging on Linux to know 
how I can nail this down with debugging software.

Just for reference of what kind these oopses are I quote the last two. The 
symptom is that the machine stops responding but is pingable. After a 
reset all is well until the next crash in ten days or so.

Apr 29 15:05:07 nx05 kernel: c014802d
Apr 29 15:05:07 nx05 kernel: Modules linked in: nls_utf8 cifs smbfs 
ipt_REJECT ipt_limit ipt_state ipt_LOG iptable_filter ip_tables 
ip_conntrack_ftp ip_conntrack md5 ipv6 autofs4 sunrpc dm_mirror dm_mod 
button battery ac 8139too mii ext3 jbd ata_piix libata sd_mod scsi_mod
Apr 29 15:05:07 nx05 kernel: CPU:    0
Apr 29 15:05:07 nx05 kernel: EIP:    0060:[<c014802d>]    Not tainted VLI
Apr 29 15:05:07 nx05 kernel: EFLAGS: 00010006   (2.6.9-34.EL)
Apr 29 15:05:07 nx05 kernel: EIP is at find_get_page+0x73/0xdd
Apr 29 15:05:07 nx05 kernel: eax: 00000200   ebx: df002bf4   ecx: 00000200 
  edx: 00000200
Apr 29 15:05:07 nx05 kernel: esi: df002bf4   edi: 00000000   ebp: c1b64d1c 
  esp: c1b64cf0
Apr 29 15:05:07 nx05 kernel: ds: 007b   es: 007b   ss: 0068
Apr 29 15:05:07 nx05 kernel: Process httpd (pid: 24805, 
threadinfo=c1b64000 task=d308ad50)
Apr 29 15:05:07 nx05 kernel: Stack: 00000000 c016a705 0029005a 00000000 
00000001 00000000 00000000 df002b14
Apr 29 15:05:07 nx05 kernel:        00000000 0029005a 00000000 df002a80 
c016bdb7 00001000 00000000 df73b800
Apr 29 15:05:07 nx05 kernel:        0029005a 00000000 df002a80 c016bde6 
00001000 df73b800 cfc14bf4 00000000
Apr 29 15:05:07 nx05 kernel: Call Trace:
Apr 29 15:05:07 nx05 kernel:  [<c016a705>] 
__find_get_block_slow+0x4b/0x1c6
Apr 29 15:05:07 nx05 kernel:  [<c016bdb7>] __find_get_block+0x89/0xa5
Apr 29 15:05:07 nx05 kernel:  [<c016bde6>] __getblk+0x13/0x49
Apr 29 15:05:07 nx05 kernel:  [<e09298e3>] ext3_get_inode_loc+0x4f/0x223 
[ext3]
Apr 29 15:05:07 nx05 kernel:  [<e0929b45>] ext3_read_inode+0x38/0x309 
[ext3]
Apr 29 15:05:07 nx05 kernel:  [<c030fbf0>] __cond_resched+0x14/0x3b
Apr 29 15:05:07 nx05 kernel:  [<e092e1b7>] ext3_alloc_inode+0xf/0x46 
[ext3]
Apr 29 15:05:07 nx05 kernel:  [<c0185596>] alloc_inode+0xf6/0x17f
Apr 29 15:05:07 nx05 kernel:  [<c018662e>] get_new_inode_fast+0xa5/0x1e9
Apr 29 15:05:07 nx05 kernel:  [<e092b9a0>] ext3_lookup+0x55/0x87 [ext3]
Apr 29 15:05:07 nx05 kernel:  [<c0177d32>] real_lookup+0x73/0xde
Apr 29 15:05:07 nx05 kernel:  [<c0178062>] do_lookup+0x56/0x8f
Apr 29 15:05:07 nx05 kernel:  [<c0178ad4>] __link_path_walk+0xa39/0xd98
Apr 29 15:05:07 nx05 kernel:  [<c0178e74>] link_path_walk+0x41/0xb9
Apr 29 15:05:07 nx05 kernel:  [<c011e867>] 
autoremove_wake_function+0x0/0x2d
Apr 29 15:05:07 nx05 kernel:  [<c017916c>] path_lookup+0x104/0x135
Apr 29 15:05:07 nx05 kernel:  [<c01792b1>] __user_walk+0x21/0x51
Apr 29 15:05:07 nx05 kernel:  [<c01734b8>] vfs_stat+0x14/0x3a
Apr 29 15:05:07 nx05 kernel:  [<c011e867>] 
autoremove_wake_function+0x0/0x2d
Apr 29 15:05:07 nx05 kernel:  [<c0173ac1>] sys_stat64+0xf/0x23
Apr 29 15:05:07 nx05 kernel:  [<c0168b76>] vfs_read+0xda/0xe2
Apr 29 15:05:07 nx05 kernel:  [<c0168d65>] sys_read+0x3c/0x62
Apr 29 15:05:07 nx05 kernel:  [<c0311443>] syscall_call+0x7/0xb
Apr 29 15:05:07 nx05 kernel:  [<c031007b>] 
rwsem_down_read_failed+0x19f/0x204
Apr 29 15:05:07 nx05 kernel: Code: c0 e8 77 8d fd ff c7 43 14 01 00 00 00 
8d 43 04 c7 43 20 54 1f 32 c0 c7 43 24 0e 02 00 00 e8 1e c5 09 00 85 c0 89 
c1 74 0f 89 c2 <8b> 00 f6 c4 80 74 03 8b 51 0c ff 42 04 81 7b 10 3c 4b 24 
1d 74

Apr 20 03:55:31 nx05 kernel: c0167d59
Apr 20 03:55:31 nx05 kernel: Modules linked in: smbfs ipt_REJECT ipt_limit 
ipt_state ipt_LOG ip_conntrack_ftp ip_conntrack iptable_filter ip_tables 
parport_pc lp parport md5 ipv6 autofs4 sunrpc dm_mirror dm_mod button 
battery ac 8139too mii ext3 jbd ata_piix libata sd_mod scsi_mod
Apr 20 03:55:31 nx05 kernel: CPU:    0
Apr 20 03:55:31 nx05 kernel: EIP:    0060:[<c0167d59>]    Not tainted VLI
Apr 20 03:55:31 nx05 kernel: EFLAGS: 00010286   (2.6.9-34.EL)
Apr 20 03:55:31 nx05 kernel: EIP is at __dentry_open+0x62/0x16a
Apr 20 03:55:31 nx05 kernel: eax: a093ffa0   ebx: dfab0680   ecx: 0000000d 
  edx: dfe51100
Apr 20 03:55:31 nx05 kernel: esi: cd1dd78c   edi: dfe51100   ebp: 00000000 
  esp: dda19f30
Apr 20 03:55:31 nx05 kernel: ds: 007b   es: 007b   ss: 0068
Apr 20 03:55:31 nx05 kernel: Process mysqld (pid: 12922, 
threadinfo=dda19000 task=dee2edd0)
Apr 20 03:55:31 nx05 kernel: Stack: cae8a978 00000000 dfab0680 00008000 
00000000 c0167c96 dfab0680 caeac000
Apr 20 03:55:31 nx05 kernel:        cae8a978 dfe51100 dda19f58 dec82680 
dda19f88 00000101 00000001 00000000
Apr 20 03:55:31 nx05 kernel:        00001000 b0d4e780 dda19f80 c030fbf0 
caeac000 c01e67f2 00000000 00000039
Apr 20 03:55:31 nx05 kernel: Call Trace:
Apr 20 03:55:31 nx05 kernel:  [<c0167c96>] filp_open+0x5c/0x70
Apr 20 03:55:31 nx05 kernel:  [<c030fbf0>] __cond_resched+0x14/0x3b
Apr 20 03:55:31 nx05 kernel:  [<c01e67f2>] 
direct_strncpy_from_user+0x3e/0x5d
Apr 20 03:55:31 nx05 kernel:  [<c016819f>] sys_open+0x31/0x7d
Apr 20 03:55:31 nx05 kernel:  [<c0311443>] syscall_call+0x7/0xb
Apr 20 03:55:31 nx05 kernel:  [<c031007b>] 
rwsem_down_read_failed+0x19f/0x204
Apr 20 03:55:31 nx05 kernel: Code: dc 00 00 00 89 83 a0 00 00 00 8b 04 24 
89 7b 0c c7 43 24 00 00 00 00 89 43 08 c7 43 28 00 00 00 00 8b 86 d0 00 00 
00 85 c0 74 19 <8b> 00 85 c0 74 0b 83 38 02 74 0e ff 80 00 01 00 00 8b 86 
d0 00

Kai