On Fri, Mar 16, 2012 at 01:33:54PM -0700, Aaron Blew wrote:
Hello all, I'm currently experiencing an issue with an NFS server I've built (a Dell R710 with a Dell PERC H800/LSI 2108 and four external disk trays). It's a backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that mount it's data volume via NFS. It has two 10gig NICs set up in a layer2+3 bond for one network, and two more 10gig NICs set up in the same way in another network. The host has a 99T XFS filesystem for the backups. RPCNFSDCOUNT is set to 256.
During backups from clients the system exhibits odd hangs that interfere with some of our sensitive system's backup windows. On the NFS server side we see the following in dmesg. Originally I thought it was related to dirty writeback cache, but I adjusted dirty_writeback_centisecs and am still seeing the issue.
dmesg during the problem window: Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure.
<snip>
Has anyone else seem similar issues? I can provide additional details about the server/configuration if anybody needs anything else. The issue only seems to occur under high write load as we've restored some of these backups and didn't seem to have an issue reading the data.
The page allocation failure message made me wonder if your issue could be related to the issue I've run into here[1] on RHEL 6.2.
My issue seems to be related to NFS mounting, but it's possible the root cause could be the same?
A few other links:
https://bugzilla.redhat.com/show_bug.cgi?id=593035 http://www.spinics.net/lists/linux-nfs/msg22248.html
Red Hat has provided me with a test kernel which purportedly will resolve the issue. I haven't had a chance to test it out yet.
Ray