[CentOS] unstable kernel after update to CentOS 4.5

Mon Dec 10 12:23:26 UTC 2007
Kai Schaetzl <maillists at conactive.com>

On Saturday I finally upgraded a machine from CentOS 4.3 (I think)
to 4.5 via yum. Seemed to went fine. However, during the following
night /home got mounted read-only because of an EXT3-fs error. The
next night happened the same. Also, today, I saw the first-ever 
kernel crash on this machine.
The machine is about three years old or so, went into production
two years ago with CentOS 4.1 or so and has been rock stable since
then. The fs errors, no kernel crashes, no other "weird" occurences.
As the problems are now happening right after upgrading to a new 
kernel I rather suspect a bug in the kernel (or some module) than
a hardware problem. No RAID, no LVM, a few partitions on an IDE disk.
I didn't file it as a bug yet. I want to first gather some more 
information or get some help.
Here are some details.

Kernel was updated from 2.6.9-34.0.2.EL to 2.6.9-55.0.12.EL.
There is not a single package update missing now.

Dec  9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): htree_dirblock_to_tree: bad entry in directory 
#1330023: rec_len % 4 != 0 - offset=10264, inode=808542775, rec_len=13621, name_len=100
Dec  9 04:30:35 nx10 kernel: Aborting journal on device hda3.
Dec  9 04:30:35 nx10 kernel: ext3_abort called.
Dec  9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): ext3_journal_start_sb: Detected aborted journal
Dec  9 04:30:35 nx10 kernel: Remounting filesystem read-only

The second error tonight happened about 5 minutes earlier. 
With exactly the same directory inode.
http://www.google.de/search?as_q=centos+rec_len+4+0&hl=de&num=30&btnG=Google-Suche&as_epq=bad+entry+in+direc
tory&as_oq=&as_eq=&lr=&cr=&as_ft=i&as_filetype=&as_qdr=all&as_occt=any&as_dt=i&as_sitesearch=&as_rights=&saf
e=images
shows this error is very scarce (I also tried it with fedora and got a few more).
It seems to be related to heavy disk i/o, but only under certain (hardware?)
circumstances and may be a bug introduced in some Fedora kernel and this
krept into RHEL/CentOS 4.4/4.5.
Once this happens that filesystem (in my case /home) is read-only and
the machine just hangs when one tries to shutdown (probably when 
unmounting) or remount ro (for a file check). After a hard reset the
automatic fschk in dmesg lists only an few orphan inode cleanups.
Also, I found that dmesg delivers me an output of the iptables logging 
(which is on kern.=debug) before the problem is fixed with a reset.
Can I use fsdebug safely on that system while mounted? I'm not familiar
with it and just stumbled over a mention of it. I tried it on a machine
here on a mounted device and there was no problem. That other machine is
in a remote data center, so options are a bit limited.

The kernel crash from today starts like this:
Dec 10 10:30:01 nx10 kernel: Unable to handle kernel paging request at virtual address 8f38df23
Dec 10 10:30:01 nx10 kernel:  printing eip:
Dec 10 10:30:01 nx10 kernel: c019190b
Dec 10 10:30:01 nx10 kernel: *pde = 00000000
Dec 10 10:30:01 nx10 kernel: Oops: 0000 [#1]
Dec 10 10:30:01 nx10 kernel: Modules linked in: ipt_REJECT ipt_limit ipt_state ipt_LOG iptable_filter 
ip_tables ip_conntrack_ftp ip_conntrack md5 ipv6 autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button 
battery ac 8139too mii ext3 jbd ata_piix libata sd_mod scsi_mod
Dec 10 10:30:01 nx10 kernel: CPU:    0
Dec 10 10:30:01 nx10 kernel: EIP:    0060:[<c019190b>]    Not tainted VLI
Dec 10 10:30:01 nx10 kernel: EFLAGS: 00010282   (2.6.9-55.0.12.EL)
Dec 10 10:30:01 nx10 kernel: EIP is at seq_escape+0x21/0xaa
Dec 10 10:30:01 nx10 kernel: eax: 8f38df23   ebx: c0370260   ecx: d35a9151   edx: d35aa000
Dec 10 10:30:01 nx10 kernel: esi: c518c200   edi: c518c200   ebp: c032f9d9   esp: c63d9f28
Dec 10 10:30:01 nx10 kernel: ds: 007b   es: 007b   ss: 0068
Dec 10 10:30:01 nx10 kernel: Process mv (pid: 16585, threadinfo=c63d9000 task=cee5a1b0)
Dec 10 10:30:01 nx10 kernel: Stack: d35aa000 8f38df23 c0370260 c518c200 dfe08982 00000000 c018e0e3 c03702c0
Dec 10 10:30:01 nx10 kernel:        c518c200 00000000 dfe08982 c019157f 00000151 00000000 00000400 b7fd5000
Dec 10 10:30:01 nx10 kernel:        0000000c 00000000 0000000b 00000000 c0371300 cea00b80 00000400 c63d9fac
Dec 10 10:30:01 nx10 kernel: Call Trace:
Dec 10 10:30:01 nx10 kernel:  [<c018e0e3>] show_vfsmnt+0x28/0xf5
Dec 10 10:30:01 nx10 kernel:  [<c019157f>] seq_read+0x1c3/0x2bd
Dec 10 10:30:01 nx10 kernel:  [<c016c91b>] vfs_read+0xb6/0xe2
Dec 10 10:30:01 nx10 kernel:  [<c016cb30>] sys_read+0x3c/0x62
Dec 10 10:30:01 nx10 kernel:  [<c031b777>] syscall_call+0x7/0xb


I wonder if I can go back to 2.6.9-34.0.2.EL. Should I expect problems
with other updated packages?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com