[CentOS] Centos 5.6 Kernel Panics

Wed Apr 4 18:10:39 UTC 2012
Paul (Crunch) <numbercruncher245 at gmail.com>

On 04/04/2012 12:31 PM, Nataraj wrote:
> On 04/04/2012 09:16 AM, Jonathan Alstead wrote:
>> Hello,
>>
>> Recently our dell sc1425 server has been locking up with kernel freezes
>> and required a hard reboot on each occasion. I've looked on the centos
>> forums with limited success - each problem seems slightly different
>> (some failure on high load, some not). Our kernel is 2.6.18-274.17.1.el5
>> and /var/log/messages show the following errors:
>>
>> Apr  3 12:41:25 sp2 kernel: INFO: task mysqld:15345 blocked for more
>> than 120 seconds.
>> Apr  3 12:41:25 sp2 kernel: "echo 0>
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Apr  3 12:41:25 sp2 kernel: mysqld        D 00000CEB  2524 15345  32083
>>           15346 15167 (NOTLB)
>> Apr  3 12:41:25 sp2 kernel:        c50c7f54 00000082 bf379c08 00000ceb
>> ca9b1648 f43c6c5c 00000000 00000001
>> Apr  3 12:41:25 sp2 kernel:        d9d18000 bf384f01 00000ceb 0000b2f9
>> 00000001 d9d1810c c2013ac4 edc5de40
>> Apr  3 12:41:25 sp2 kernel:        08515c98 c6cb37b8 c2014464 c200cc80
>> 00000020 00000000 00000000 00000000
>> Apr  3 12:41:25 sp2 kernel: Call Trace:
>> Apr  3 12:41:25 sp2 kernel:  [<c0622f16>]
>> rwsem_down_write_failed+0x126/0x141
>> Apr  3 12:41:25 sp2 kernel:  [<c0439989>] .text.lock.rwsem+0x2b/0x3a
>> Apr  3 12:41:25 sp2 kernel:  [<c046aa6a>] sys_mprotect+0xbd/0x1eb
>>
>> Apr  3 12:41:25 sp2 kernel:  [<c0404f4b>] syscall_call+0x7/0xb
>>
>> Apr  3 12:41:25 sp2 kernel:  =======================
>> Apr  3 12:41:25 sp2 kernel: INFO: task clamd:15721 blocked for more than
>> 120 seconds.
>> Apr  3 12:41:26 sp2 kernel: "echo 0>
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Apr  3 12:41:26 sp2 kernel: clamd         D 00000D49  2528 15721      1
>>           16416 15449 (NOTLB)
>> Apr  3 12:41:26 sp2 kernel:        e848cf74 00000086 8f107b57 00000d49
>> 30ea2005 e848cf44 c08259d0 00000007
>> Apr  3 12:41:26 sp2 kernel:        e8c6aaa0 8f117848 00000d49 0000fcf1
>> 00000000 e8c6abac c200cc80 f4f5f3c0
>> Apr  3 12:41:26 sp2 kernel:        c041f863 00000184 c200d620 c2013ac4
>> 00000020 00000000 d887f0a8 f766f0c0
>> Apr  3 12:41:26 sp2 kernel: Call Trace:
>> Apr  3 12:41:26 sp2 kernel:  [<c041f863>] default_wake_function+0x0/0xc
>> Apr  3 12:41:26 sp2 kernel:  [<c048e994>] destroy_inode+0x38/0x47
>> Apr  3 12:41:26 sp2 kernel:  [<c0622f16>]
>> rwsem_down_write_failed+0x126/0x141
>> Apr  3 12:41:26 sp2 kernel:  [<c0439989>] .text.lock.rwsem+0x2b/0x3a
>> Apr  3 12:41:26 sp2 kernel:  [<c046a32b>] sys_munmap+0x24/0x41
>>
>> Apr  3 12:41:26 sp2 kernel:  [<c0404f4b>] syscall_call+0x7/0xb
>>
> It sounds like some kind of IO or memory problem.  I would probably
> start by running MEMTEST and the basic diagnostic tests provided by
> DELL, which if you don't have installed on your disk can be downloaded
> in the form of  a CentOS based openmange liveCD from somewhere on the
> dell site.  It could also be a disk problem, but from the output you
> provide I think I would look for memory or IO bus problems first and
> then look for disk problems if you don't find anything with the first
> two.  It almost looks like a memory controller problem.
>
> Nataraj
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
I'm inclined to agree with Nataraj. A memory test first and foremost. 
Check for any corruption on the file system. Chances are rare, but the 
on disk kernel could be damaged by data corruption. Unfortunately, I 
don't know of a practical way of testing the buses  and possibly even 
then CPU aside  from swapping hardware out.