[CentOS] Help with disk server stability issues

Andrew Zahn wrote:
> Hi All,
>
> I am looking for advice on how to cure a constantly-crashing NFS 
> server which crashes every few hours, or at least, every few days. The 
> kernel log file (below) points toward NFS as a likely cause.
>
> The system disk is a 3ware 8000 series RAID1 mirror. The data disk is
> using a 3Ware 9000 controller to produce two RAID1 devices; these are 
> then striped (RAID0) in software to form a RAID 10 device.  We're 
> using a 2.6 kernel, xfs filesystem, and NFS3/UDP.
>
> We're running CentOS 4.2 with a 2.6.9-22.0.1.106 kernel.  This kernel 
> has xfs extensions, and we're running the xfs filesystem for /home 
> (obtained from CentOS website).
>
> In "lsmod" I see both 3w_xxxx and 3w_9xxx modules.
>
> NFS is over UDP, jumbo frames (9000), 32k rsize/wsize, async server, 
> async clients, noac.
>
> This system has been serving /home in this configuration since October 
> 2005; we've seen it crash rarely, but uptimes were usually on the 
> order of months.  This past week, it can't seem to remain up for much 
> longer than about a day.
>
> Kernel log file containing the crash:
>
> Feb 12 05:52:03 tier2-home kernel: Unable to handle kernel NULL 
> pointer dereference at virtual address 00000000
> Feb 12 05:52:03 tier2-home kernel:  printing eip:
> Feb 12 05:52:03 tier2-home kernel: 00000000
> Feb 12 05:52:03 tier2-home kernel: *pde = f561f067
> Feb 12 05:52:03 tier2-home kernel: Oops: 0000 [#1]
> Feb 12 05:52:03 tier2-home kernel: PREEMPT SMP
> Feb 12 05:52:03 tier2-home kernel: Modules linked in: nfs nfsd 
> exportfs lockd md5 ipv6 autofs4 i2c_dev i2c_core sunrpc xfs dm_mirror 
> dm_mod button battery ac uhci_hcd shpchp e100
> 0 floppy ext3 jbd 3w_9xxx 3w_xxxx sd_mod scsi_mod
> Feb 12 05:52:03 tier2-home kernel: CPU:    0

<snip>

Hmmm...does it also crash if you run a non-SMP kernel?  Did you update 
the kernel around the same time as the instability began?

Cheers,