[CentOS] NFS4 issue

Mon Nov 23 01:25:41 UTC 2009
Philip Manuel <phil at zomojo.com>

We are running kernel 2.6.18-164.6.1.el5 with exporting 3 aoe provided 
ext4 directories. For a couple of weeks we had a small number of users 
using the system with no issues, today we added 7 users and the system 
crashed and did not perform correctly since.

Nov 23 10:20:03 sulphur rpc.idmapd[5199]: nfsdcb: id '-2' too big!
Nov 23 10:42:25 sulphur nfsd[27306]: nfssvc: Setting version failed: 
errno 16 (Device or resource busy)
Nov 23 10:42:25 sulphur nfsd[27306]: nfssvc: unable to bind UPD socket: 
errno 98 (Address already in use)
Nov 23 10:42:26 sulphur kernel: slab error in kmem_cache_destroy(): 
cache `nfsd4_files': Can't free all objects
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88645efd>] 
:nfsd:nfsd4_free_slab+0x11/0x4d
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88645f55>] 
:nfsd:nfsd4_free_slabs+0x1c/0x33
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88646ecb>] 
:nfsd:nfs4_state_shutdown+0x17e/0x18a
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630570>] 
:nfsd:nfsd_last_thread+0x45/0x76
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630856>] :nfsd:nfsd+0x2b5/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] :nfsd:nfsd+0x0/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] :nfsd:nfsd+0x0/0x2cb
Nov 23 10:42:26 sulphur kernel: BUG: warning at 
fs/nfsd/nfs4state.c:1016/nfsd4_free_slab() (Tainted: G     )
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88645f55>] 
:nfsd:nfsd4_free_slabs+0x1c/0x33
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88646ecb>] 
:nfsd:nfs4_state_shutdown+0x17e/0x18a
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630570>] 
:nfsd:nfsd_last_thread+0x45/0x76
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630856>] :nfsd:nfsd+0x2b5/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] :nfsd:nfsd+0x0/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] :nfsd:nfsd+0x0/0x2cb
Nov 23 10:42:26 sulphur kernel: slab error in kmem_cache_destroy(): 
cache `nfsd4_delegations': Can't free all objects
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88645efd>] 
:nfsd:nfsd4_free_slab+0x11/0x4d
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88646ecb>] 
:nfsd:nfs4_state_shutdown+0x17e/0x18a
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630570>] 
:nfsd:nfsd_last_thread+0x45/0x76
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630856>] :nfsd:nfsd+0x2b5/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] :nfsd:nfsd+0x0/0x2cb
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] 
:nfsd:nfsd+0x0/0x2cb                   
Nov 23 10:42:26 sulphur kernel: BUG: warning at 
fs/nfsd/nfs4state.c:1016/nfsd4_free_slab() (Tainted: G     )
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88646ecb>] 
:nfsd:nfs4_state_shutdown+0x17e/0x18a  
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630570>] 
:nfsd:nfsd_last_thread+0x45/0x76       
Nov 23 10:42:26 sulphur kernel:  [<ffffffff88630856>] 
:nfsd:nfsd+0x2b5/0x2cb                 
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] 
:nfsd:nfsd+0x0/0x2cb                   
Nov 23 10:42:26 sulphur kernel:  [<ffffffff886305a1>] 
:nfsd:nfsd+0x0/0x2cb                   
Nov 23 10:42:26 sulphur kernel: nfsd: last server has 
exited                                 
Nov 23 10:42:26 sulphur kernel: nfsd: unexporting all 
filesystems                            
Nov 23 10:42:44 sulphur kernel: kmem_cache_create: duplicate cache 
nfsd4_files               
Nov 23 10:42:44 sulphur kernel:  [<ffffffff88646f29>] 
:nfsd:nfs4_state_start+0x52/0x18f      
Nov 23 10:42:44 sulphur kernel:  [<ffffffff886303ae>] 
:nfsd:nfsd_svc+0x6c/0x1e9              
Nov 23 10:42:44 sulphur kernel:  [<ffffffff88630f8e>] 
:nfsd:write_threads+0x0/0xa9           
Nov 23 10:42:44 sulphur kernel:  [<ffffffff88630ffd>] 
:nfsd:write_threads+0x6f/0xa9          
Nov 23 10:42:44 sulphur kernel:  [<ffffffff88630f8e>] 
:nfsd:write_threads+0x0/0xa9           
Nov 23 10:42:44 sulphur kernel:  [<ffffffff88630d59>] 
:nfsd:nfsctl_transaction_write+0x42/0x77Nov 23 10:42:44 sulphur 
nfsd[27369]: nfssvc: Cannot allocate memory                          
Nov 23 10:43:55 sulphur nfsd[27495]: nfssvc: Setting version failed: 
errno 16 (Device or resource 
busy)                                                                                     

Nov 23 10:43:55 sulphur nfsd[27495]: nfssvc: unable to bind UPD socket: 
errno 98 (Address already in use)

So above shows the original problem and then me restarting it and 
eventually I had to reboot the server.  Since then it has been behaving 
bizarrely with it running for 5 mins and then stopping, upon a restart 
it will run for a while and then stop.
Nov 23 11:04:46 sulphur kernel: NFSD: Using /var/lib/nfs/v4recovery as 
the NFSv4 state recovery directory
Nov 23 11:17:02 sulphur rpc.idmapd[8178]: nfsdcb: id '-2' too big!
Nov 23 11:29:01 sulphur kernel: nfsd: last server has exited
Nov 23 11:29:01 sulphur kernel: nfsd: unexporting all filesystems
Nov 23 11:29:08 sulphur kernel: NFSD: Using /var/lib/nfs/v4recovery as 
the NFSv4 state recovery directory
Nov 23 11:29:08 sulphur rpc.idmapd[8178]: nfsdcb: id '-2' too big!
Nov 23 11:32:03 sulphur kernel: nfsd: last server has exited
Nov 23 11:32:03 sulphur kernel: nfsd: unexporting all filesystems
Nov 23 11:32:34 sulphur kernel: NFSD: Using /var/lib/nfs/v4recovery as 
the NFSv4 state recovery directory
Nov 23 11:32:34 sulphur rpc.idmapd[8178]: nfsdcb: id '-2' too big!
Nov 23 11:41:58 sulphur kernel: nfsd: last server has exited
Nov 23 11:41:58 sulphur kernel: nfsd: unexporting all filesystems
Nov 23 11:42:03 sulphur kernel: NFSD: Using /var/lib/nfs/v4recovery as 
the NFSv4 state recovery directory
Nov 23 11:42:03 sulphur rpc.idmapd[8178]: nfsdcb: id '-2' too big!
Nov 23 11:47:20 sulphur kernel: nfsd: last server has exited
Nov 23 11:47:20 sulphur kernel: nfsd: unexporting all filesystems

I haven't found a report of an issues for the "nfsdcb: id '-2' too 
big!"  message but equally I don't know what it means either.

On the console we are seeing loads of these messages:-

kernel: NFSD: preprocess_seqid_op: magic stateid!

Again I don't know what this means or the implications of this message.

Any suggestions would be welcome.

At the moment we are up with two users migrated back to the old servers.

Thanks

Phil.