[CentOS] NFS issues

Wed Aug 13 16:27:18 UTC 2008
Matthew Kent <matt at bravenet.com>

On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:
> So I'm running nfs to get content to my web servers. Now I've had this
> problem 2 times (about 2 weeks since the last occurrence).
> I use drbd on the nfs server for redundancy. Now to my problem:
> 
> All my web sites stopped responding so I started by checking dmesg and
> there I found a bunch of this errors
> Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
> Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
> 
> But when checking the nfs server lockd was running and I could access
> all the files from the webserver with ls, cd etc.

This is the exact problem we were having here. Rebooting is the only
solution.

And as already mentioned further down the thread it was attributed to
this https://bugzilla.redhat.com/show_bug.cgi?id=453094

My solution was to extract the patch from the upstream kernel in 
http://people.redhat.com/dzickus/el5/103.el5/src/
called
linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

and reroll the latest centosplus kernel srpm with it. Servers have been
fine for 6 days running this kernel.

As much as I hate carrying custom kernel rpms this is a showstopper for
us, and it looks like it won't make in until 5.3. 

Personally given the limited scope of the patch and apparent
unwillingness of redhat to include it in an update I'd advocate CentOS
carrying it as a custom patch.

Here's my srpm if anyone wants it, 
http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm
the only change is the patch for this issue. Everything builds cleanly
via mock. 
-- 
Matthew Kent \ SA \ bravenet.com