So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:
All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors ||
Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.
The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".
Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)
Also I get this from time to time on the web servers, dunno if it's related. /do_vfs_lock: VFS is out of sync with lock manager! /
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?
Johan Swensson wrote:
So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:
All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors || Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.
The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".
Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)
Also I get this from time to time on the web servers, dunno if it's related. /do_vfs_lock: VFS is out of sync with lock manager! /
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Johan Swensson wrote:
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?
What does 'rpcinfo -p' read on both the servers and the clients?
Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)
Any firewalls in between client and server? Run: iptables -L -n (on both client and server)
nate
On Tue, 2008-08-12 at 20:16 -0700, nate wrote:
Johan Swensson wrote:
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?
What does 'rpcinfo -p' read on both the servers and the clients?
Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)
Any firewalls in between client and server? Run: iptables -L -n (on both client and server)
---- I don't want to step on Johan's thread but wanted to commiserate with him.
No firewall's at present... nfs and nfslock on both client and server are running and show pid's
client [root@cube ~]# rpcinfo -p program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100000 4 0 111 portmapper 100000 3 0 111 portmapper 100000 2 0 111 portmapper 100024 1 udp 50259 status 100024 1 tcp 53710 status 100021 1 tcp 53045 nlockmgr 100021 3 tcp 53045 nlockmgr 100021 4 tcp 53045 nlockmgr
server [root@srv1 log]# rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 4003 status 100024 1 tcp 4003 status 100011 1 udp 4000 rquotad 100011 2 udp 4000 rquotad 100011 1 tcp 4000 rquotad 100011 2 tcp 4000 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100021 1 udp 4001 nlockmgr 100021 3 udp 4001 nlockmgr 100021 4 udp 4001 nlockmgr 100021 1 tcp 4001 nlockmgr 100021 3 tcp 4001 nlockmgr 100021 4 tcp 4001 nlockmgr 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 4002 mountd 100005 1 tcp 4002 mountd 100005 2 udp 4002 mountd 100005 2 tcp 4002 mountd 100005 3 udp 4002 mountd 100005 3 tcp 4002 mountd
Server has ports fixed in place with settings in /etc/sysconfig/nfs
Craig
No firewall on either end and server responds to ping.
client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status server:
program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 964 status 100024 1 tcp 967 status 100011 1 udp 718 rquotad 100011 2 udp 718 rquotad 100011 1 tcp 721 rquotad 100011 2 tcp 721 rquotad 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100021 1 udp 32768 nlockmgr 100021 3 udp 32768 nlockmgr 100021 4 udp 32768 nlockmgr 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100021 1 tcp 58027 nlockmgr 100021 3 tcp 58027 nlockmgr 100021 4 tcp 58027 nlockmgr 100005 1 udp 778 mountd 100005 1 tcp 781 mountd 100005 2 udp 778 mountd 100005 2 tcp 781 mountd 100005 3 udp 778 mountd 100005 3 tcp 781 mountd
However I just rebooted the nfs server. But when I checked before lockd was running with a ps -A As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server. nate wrote:
Johan Swensson wrote:
It happend again this night but now I temporarily(?) fixed it with mounting -o nolock on the web servers. It works but dmesg is still spamming "lockd: server 192.168.20.22 not responding, timed out". Atleast my sites are up, and the message isn't critical anymore. But how can I get rid of it?
What does 'rpcinfo -p' read on both the servers and the clients?
Also how about /etc/init.d/nfs status (both client and server) and /etc/init.d/nfslock status (both client and server)
Any firewalls in between client and server? Run: iptables -L -n (on both client and server)
nate
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Not wanting to hijack the thread, but since a similar date I've had issues with NFS updates being 'delayed' for anything between two seconds to six hours.
Weird one.
Johan Swensson wrote:
No firewall on either end and server responds to ping.
client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status
Doesn't look like nfslock is running on the client?
What does /etc/init.d/nfslock status say?
As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server.
Maybe related to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=453094
Try restarting nfslock on both client and server when it occurs? Or try setting up a cron to restart nfslock hourly on all systems to see if that can workaround the issue until a fix comes out?
nate
nate wrote:
Johan Swensson wrote:
No firewall on either end and server responds to ping.
client: program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 889 status 100024 1 tcp 892 status
Doesn't look like nfslock is running on the client?
What does /etc/init.d/nfslock status say?
[root@web03 ~]# service nfslock status rpc.statd (pid 2737) is running...
As Craig said he started notice this about the the time he upgraded to 5.2, the same goes for me, started getting this problem about the time I've upgraded the clients and server.
Maybe related to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=453094
Try restarting nfslock on both client and server when it occurs? Or try setting up a cron to restart nfslock hourly on all systems to see if that can workaround the issue until a fix comes out?
nate
Actually I tried restarting both nfslock(on clients and server) and nfs(on server) but it didn't help. Is my solution with mounting it nolock bad?
I was also thinking about mounting the nfs shares as soft, is this a good idea? Could it help me? And also, what's the difference between soft and intr? Read the manual and I thought they were pretty similiar.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Wed, Aug 13, 2008 at 09:48, Johan Swensson johan.swensson@apegroup.com wrote:
I was also thinking about mounting the nfs shares as soft, is this a good idea?
No, this is a bad idea. Mounting as soft means that if there is any errors or timeouts, your writes will fail, and most programs don't check for the status of those, so you will have undetectable data loss.
And also, what's the difference between soft and intr?
Intr (which is a good idea) means that you can use "kill" to stop processes that are hung waiting for the NFS server. The problem with "intr" is that I never saw it working. When my NFS server goes down, the processes that are waiting for it will stay in "D" state, no matter if I try to "kill" or even "kill -9" them... So, although "intr" seems like a good idea, in practice it does not make much of a difference.
HTH, Filipe
On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:
So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:
All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.
The logs on the nfs server doesn't say anything of interest and checking apaches error_log just says "not found or unable to stat".
Now I mentioned this have happened 2 times and both these times I've "solved" it by rebooting the nfs server and web servers. This isn't a good solution to have to reboot my servers every couple of weeks so I really could use some help. :)
Also I get this from time to time on the web servers, dunno if it's related. do_vfs_lock: VFS is out of sync with lock manager!
---- I too have been having the same issues with my nfs server - which seems to have started when I updated on July 27th (5.2)
It seems to happen after logrotate on Sunday morning but I didn't know about it until users show up on Monday mornings.
/var/log/messages has...
Aug 4 09:32:59 cube kernel: lockd: server HOSTNAME not responding, still trying
and like you, I've rebooted the main server each time (Monday mornings)...there's something wrong that I can't figure out
Craig
On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:
So I'm running nfs to get content to my web servers. Now I've had this problem 2 times (about 2 weeks since the last occurrence). I use drbd on the nfs server for redundancy. Now to my problem:
All my web sites stopped responding so I started by checking dmesg and there I found a bunch of this errors Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
But when checking the nfs server lockd was running and I could access all the files from the webserver with ls, cd etc.
This is the exact problem we were having here. Rebooting is the only solution.
And as already mentioned further down the thread it was attributed to this https://bugzilla.redhat.com/show_bug.cgi?id=453094
My solution was to extract the patch from the upstream kernel in http://people.redhat.com/dzickus/el5/103.el5/src/ called linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch
and reroll the latest centosplus kernel srpm with it. Servers have been fine for 6 days running this kernel.
As much as I hate carrying custom kernel rpms this is a showstopper for us, and it looks like it won't make in until 5.3.
Personally given the limited scope of the patch and apparent unwillingness of redhat to include it in an update I'd advocate CentOS carrying it as a custom patch.
Here's my srpm if anyone wants it, http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm the only change is the patch for this issue. Everything builds cleanly via mock.
On Wed, Aug 13, 2008 at 9:27 AM, Matthew Kent matt@bravenet.com wrote:
This is the exact problem we were having here. Rebooting is the only solution.
And as already mentioned further down the thread it was attributed to this https://bugzilla.redhat.com/show_bug.cgi?id=453094
My solution was to extract the patch from the upstream kernel in http://people.redhat.com/dzickus/el5/103.el5/src/ called linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch
and reroll the latest centosplus kernel srpm with it. Servers have been fine for 6 days running this kernel.
As much as I hate carrying custom kernel rpms this is a showstopper for us, and it looks like it won't make in until 5.3.
Personally given the limited scope of the patch and apparent unwillingness of redhat to include it in an update I'd advocate CentOS carrying it as a custom patch.
Here's my srpm if anyone wants it, http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm the only change is the patch for this issue. Everything builds cleanly via mock. -- Matthew Kent \ SA \ bravenet.com
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:
http://people.centos.org/tru/kernel+bz453094/
Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.
Akemi
On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:
http://people.centos.org/tru/kernel+bz453094/
Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.
The bugzilla link is actually this one:
https://bugzilla.redhat.com/show_bug.cgi?id=459083
Akemi
On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi amyagi@gmail.com wrote:
On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:
http://people.centos.org/tru/kernel+bz453094/
Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.
The bugzilla link is actually this one:
https://bugzilla.redhat.com/show_bug.cgi?id=459083
Akemi
kernel-2.6.18-92.1.13.el5 is out (upstream):
http://rhn.redhat.com/errata/RHSA-2008-0885.html
Akemi
On Wed, 2008-09-24 at 13:38 -0700, Akemi Yagi wrote:
On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi amyagi@gmail.com wrote:
On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi amyagi@gmail.com wrote:
CentOS developer, Tru, compiled a patched version of regular kernel and is offering it at:
http://people.centos.org/tru/kernel+bz453094/
Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5 according to the bugzilla referred to above.
The bugzilla link is actually this one:
https://bugzilla.redhat.com/show_bug.cgi?id=459083
Akemi
kernel-2.6.18-92.1.13.el5 is out (upstream):
---- yep and I'm still running an old kernel to get around this - got the notification from bugzilla today myself - hooray
Craig