NFS issues

List overview All Threads
Download

newer

older

Shell script to list group members

Getting perl CGI programs to work...

Johan Swensson

12 Aug 2008 12 Aug '08

12:27 p.m.

So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors ||

Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.

The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".

Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)

Also I get this from time to time on the web servers, dunno if it's related. /do_vfs_lock: VFS is out of sync with lock manager! /

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Johan Swensson

13 Aug 13 Aug

2:38 a.m.

It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?

Johan Swensson wrote:

...

So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors || Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.

The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".

Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)

Also I get this from time to time on the web servers, dunno if it's related. /do_vfs_lock: VFS is out of sync with lock manager! /

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

nate

3:16 a.m.

Johan Swensson wrote:

...

It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?

What does 'rpcinfo -p' read on both the servers and the clients?

Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)

Any firewalls in between client and server? Run: iptables -L -n (on both client and server)

nate

Craig White

3:25 a.m.

On Tue, 2008-08-12 at 20:16 -0700, nate wrote:

...

Johan Swensson wrote:

...
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?

What does 'rpcinfo -p' read on both the servers and the clients?

Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)

Any firewalls in between client and server? Run: iptables -L -n (on both client and server)

---- I don't want to step on Johan's thread but wanted to commiserate with him.

No firewall's at present... nfs and nfslock on both client and server are running and show pid's

client [root@cube ~]# rpcinfo -p program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100000 4 0 111 portmapper 100000 3 0 111 portmapper 100000 2 0 111 portmapper 100024 1 udp 50259 status 100024 1 tcp 53710 status 100021 1 tcp 53045 nlockmgr 100021 3 tcp 53045 nlockmgr 100021 4 tcp 53045 nlockmgr

server [root@srv1 log]# rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 4003 status 100024 1 tcp 4003 status 100011 1 udp 4000 rquotad 100011 2 udp 4000 rquotad 100011 1 tcp 4000 rquotad 100011 2 tcp 4000 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100021 1 udp 4001 nlockmgr 100021 3 udp 4001 nlockmgr 100021 4 udp 4001 nlockmgr 100021 1 tcp 4001 nlockmgr 100021 3 tcp 4001 nlockmgr 100021 4 tcp 4001 nlockmgr 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 4002 mountd 100005 1 tcp 4002 mountd 100005 2 udp 4002 mountd 100005 2 tcp 4002 mountd 100005 3 udp 4002 mountd 100005 3 tcp 4002 mountd

Server has ports fixed in place with settings in /etc/sysconfig/nfs

Craig

Johan Swensson

4:35 a.m.

No firewall on either end and server responds to ping.

client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status server:

program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 964 status 100024 1 tcp 967 status 100011 1 udp 718 rquotad 100011 2 udp 718 rquotad 100011 1 tcp 721 rquotad 100011 2 tcp 721 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100021 1 udp 32768 nlockmgr 100021 3 udp 32768 nlockmgr 100021 4 udp 32768 nlockmgr 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100021 1 tcp 58027 nlockmgr 100021 3 tcp 58027 nlockmgr 100021 4 tcp 58027 nlockmgr 100005 1 udp 778 mountd 100005 1 tcp 781 mountd 100005 2 udp 778 mountd 100005 2 tcp 781 mountd 100005 3 udp 778 mountd 100005 3 tcp 781 mountd

However I just rebooted the nfs server. But when I checked before lockd was running with a ps -A As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server. nate wrote:

...

Johan Swensson wrote:

...
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?

What does 'rpcinfo -p' read on both the servers and the clients?

Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)

Any firewalls in between client and server? Run: iptables -L -n (on both client and server)

nate

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- *Johan Swensson | apegroup* System Administrator johan@apegroup.com Mobile: +46 (0) 735 21 98 58 www.apegroup.com Fiskartorpsvägen 52, 115 42 Stockholm

andylockran

11:05 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Not wanting to hijack the thread, but since a similar date I've had issues with NFS updates being 'delayed' for anything between two seconds to six hours.

Weird one.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIor/hauMjEM4rxIQRAiefAKCicF3Y2WDNMBonO9QSuFMzDmCKYwCeNMkb 6yrbg0Ytt6ceDG6m3iTA030= =Eaq9 -----END PGP SIGNATURE-----

nate

1:16 p.m.

Johan Swensson wrote:

...

No firewall on either end and server responds to ping.

client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status

Doesn't look like nfslock is running on the client?

What does /etc/init.d/nfslock status say?

...

As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server.

Maybe related to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=453094

Try restarting nfslock on both client and server when it occurs? Or try setting up a cron to restart nfslock hourly on all systems to see if that can workaround the issue until a fix comes out?

nate

Johan Swensson

1:48 p.m.

nate wrote:

...

Johan Swensson wrote:

...
No firewall on either end and server responds to ping.

client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status

Doesn't look like nfslock is running on the client?

What does /etc/init.d/nfslock status say?

[root@web03 ~]# service nfslock status rpc.statd (pid 2737) is running...

...

...
As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server.

Maybe related to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=453094

Try restarting nfslock on both client and server when it occurs? Or try setting up a cron to restart nfslock hourly on all systems to see if that can workaround the issue until a fix comes out?

nate

Actually I tried restarting both nfslock(on clients and server) and nfs(on server) but it didn't help. Is my solution with mounting it nolock bad?

I was also thinking about mounting the nfs shares as soft, is this a good idea? Could it help me? And also, what's the difference between soft and intr? Read the manual and I thought they were pretty similiar.

...

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Filipe Brandenburger

15 Aug 15 Aug

12:49 a.m.

On Wed, Aug 13, 2008 at 09:48, Johan Swensson johan.swensson@apegroup.com wrote:

...

I was also thinking about mounting the nfs shares as soft, is this a good idea?

No, this is a bad idea. Mounting as soft means that if there is any errors or timeouts, your writes will fail, and most programs don't check for the status of those, so you will have undetectable data loss.

...

And also, what's the difference between soft and intr?

Intr (which is a good idea) means that you can use "kill" to stop processes that are hung waiting for the NFS server. The problem with "intr" is that I never saw it working. When my NFS server goes down, the processes that are waiting for it will stay in "D" state, no matter if I try to "kill" or even "kill -9" them... So, although "intr" seems like a good idea, in practice it does not make much of a difference.

HTH, Filipe

Craig White

13 Aug 13 Aug

2:52 a.m.

On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:

...

So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.

The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".

Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)

Also I get this from time to time on the web servers, dunno if it's related. do_vfs_lock: VFS is out of sync with lock manager!

---- I too have been having the same issues with my nfs server - which seems to have started when I updated on July 27th (5.2)

It seems to happen after logrotate on Sunday morning but I didn't know about it until users show up on Monday mornings.

/var/log/messages has...

Aug 4 09:32:59 cube kernel: lockd: server HOSTNAME not responding, still trying

and like you, I've rebooted the main server each time (Monday mornings)...there's something wrong that I can't figure out

Craig

Matthew Kent

4:27 p.m.

On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:

...

So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.

This is the exact problem we were having here. Rebooting is the only solution.

And as already mentioned further down the thread it was attributed to this https://bugzilla.redhat.com/show_bug.cgi?id=453094

My solution was to extract the patch from the upstream kernel in http://people.redhat.com/dzickus/el5/103.el5/src/ called linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

and reroll the latest centosplus kernel srpm with it. Servers have been fine for 6 days running this kernel.

As much as I hate carrying custom kernel rpms this is a showstopper for us, and it looks like it won't make in until 5.3.

Personally given the limited scope of the patch and apparent unwillingness of redhat to include it in an update I'd advocate CentOS carrying it as a custom patch.

Here's my srpm if anyone wants it, http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm the only change is the patch for this issue. Everything builds cleanly via mock.

-- Matthew Kent \ SA \ bravenet.com

Akemi Yagi

4 Sep 4 Sep

2:35 p.m.

On Wed, Aug 13, 2008 at 9:27 AM, Matthew Kent matt@bravenet.com wrote:

...

This is the exact problem we were having here. Rebooting is the only solution.

And as already mentioned further down the thread it was attributed to this https://bugzilla.redhat.com/show_bug.cgi?id=453094

My solution was to extract the patch from the upstream kernel in http://people.redhat.com/dzickus/el5/103.el5/src/ called linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

and reroll the latest centosplus kernel srpm with it. Servers have been fine for 6 days running this kernel.

As much as I hate carrying custom kernel rpms this is a showstopper for us, and it looks like it won't make in until 5.3.

Personally given the limited scope of the patch and apparent unwillingness of redhat to include it in an update I'd advocate CentOS carrying it as a custom patch.

Here's my srpm if anyone wants it, http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm the only change is the patch for this issue. Everything builds cleanly via mock. -- Matthew Kent \ SA \ bravenet.com

CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:

http://people.centos.org/tru/kernel+bz453094/

Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.

Akemi

Akemi Yagi

3:09 p.m.

On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:

...

CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:

http://people.centos.org/tru/kernel+bz453094/

Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.

The bugzilla link is actually this one:

https://bugzilla.redhat.com/show_bug.cgi?id=459083

Akemi

Akemi Yagi

24 Sep 24 Sep

8:38 p.m.

On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi amyagi@gmail.com wrote:

...

On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:

...
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:

http://people.centos.org/tru/kernel+bz453094/

Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.

The bugzilla link is actually this one:

https://bugzilla.redhat.com/show_bug.cgi?id=459083

Akemi

kernel-2.6.18-92.1.13.el5 is out (upstream):

http://rhn.redhat.com/errata/RHSA-2008-0885.html

Akemi

Craig White

9:23 p.m.

On Wed, 2008-09-24 at 13:38 -0700, Akemi Yagi wrote:

...

On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi amyagi@gmail.com wrote:

...
On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:

...
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:

http://people.centos.org/tru/kernel+bz453094/

Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.

The bugzilla link is actually this one:

https://bugzilla.redhat.com/show_bug.cgi?id=459083

Akemi

kernel-2.6.18-92.1.13.el5 is out (upstream):

http://rhn.redhat.com/errata/RHSA-2008-0885.html

---- yep and I'm still running an old kernel to get around this - got the notification from bugzilla today myself - hooray

Craig

6152

Age (days ago)

6195

Last active (days ago)

discuss@lists.centos.org

14 comments

7 participants

tags (0)

participants (7)

Akemi Yagi
andylockran
Craig White
Filipe Brandenburger
Johan Swensson
Matthew Kent
nate