[CentOS] NFS Hanging Under Heavy Load

Fri Mar 30 21:15:31 UTC 2012
Ray Van Dolson <rayvd at bludgeon.org>

Hope to get it installed this weekend.

Ray

On Fri, Mar 30, 2012 at 01:33:10PM -0700, Aaron Blew wrote:
> UPDATE
> 
> I rolled a new kernel that's identical to the stock CentOS 2.6.32-220.el6
> kernel with the exception of the new idmapper being enabled.  Unfortunately
> there's been no improvement.
> 
> Did you get a chance to try the RHEL kernel?
> 
> -Aaron
> 
> 
> On Fri, Mar 16, 2012 at 7:01 PM, Ray Van Dolson <rayvd at bludgeon.org> wrote:
> 
> > On Fri, Mar 16, 2012 at 01:33:54PM -0700, Aaron Blew wrote:
> > > Hello all,
> > > I'm currently experiencing an issue with an NFS server I've built (a Dell
> > > R710 with a Dell PERC H800/LSI 2108 and four external disk trays).  It's
> > a
> > > backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that
> > mount
> > > it's data volume via NFS.  It has two 10gig NICs set up in a layer2+3
> > bond
> > > for one network, and two more 10gig NICs set up in the same way in
> > another
> > > network.  The host has a 99T XFS filesystem for the backups.
> >  RPCNFSDCOUNT
> > > is set to 256.
> > >
> > > During backups from clients the system exhibits odd hangs that interfere
> > > with some of our sensitive system's backup windows.  On the NFS server
> > side
> > > we see the following in dmesg.  Originally I thought it was related to
> > > dirty writeback cache, but I adjusted dirty_writeback_centisecs and am
> > > still seeing the issue.
> > >
> > > dmesg during the problem window:
> > > Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed
> > > Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure.
> >
> > <snip>
> >
> > >
> > > Has anyone else seem similar issues?  I can provide additional details
> > > about the server/configuration if anybody needs anything else.  The issue
> > > only seems to occur under high write load as we've restored some of these
> > > backups and didn't seem to have an issue reading the data.
> >
> > The page allocation failure message made me wonder if your issue could
> > be related to the issue I've run into here[1] on RHEL 6.2.
> >
> > My issue seems to be related to NFS mounting, but it's possible the
> > root cause could be the same?
> >
> > A few other links:
> >
> >  https://bugzilla.redhat.com/show_bug.cgi?id=593035
> >  http://www.spinics.net/lists/linux-nfs/msg22248.html
> >
> > Red Hat has provided me with a test kernel which purportedly will
> > resolve the issue.  I haven't had a chance to test it out yet.
> >
> > Ray
> >
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=751992