Hello all, I'm currently experiencing an issue with an NFS server I've built (a Dell R710 with a Dell PERC H800/LSI 2108 and four external disk trays). It's a backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that mount it's data volume via NFS. It has two 10gig NICs set up in a layer2+3 bond for one network, and two more 10gig NICs set up in the same way in another network. The host has a 99T XFS filesystem for the backups. RPCNFSDCOUNT is set to 256.
During backups from clients the system exhibits odd hangs that interfere with some of our sensitive system's backup windows. On the NFS server side we see the following in dmesg. Originally I thought it was related to dirty writeback cache, but I adjusted dirty_writeback_centisecs and am still seeing the issue.
dmesg during the problem window: Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure. order:2, mode:0x20 Mar 16 07:01:21 *****store01 kernel: Pid: 6041, comm: nfsd Not tainted 2.6.32-220.4.2.el6.x86_64 #1 Mar 16 07:01:21 *****store01 kernel: Call Trace: Mar 16 07:01:21 *****store01 kernel: <IRQ> [<ffffffff81123daf>] ? __alloc_pages_nodemask+0x77f/0x940 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115dc62>] ? kmem_getpages+0x62/0x170 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115e87a>] ? fallback_alloc+0x1ba/0x270 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115e2cf>] ? cache_grow+0x2cf/0x320 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115e5f9>] ? ____cache_alloc_node+0x99/0x160 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8142186a>] ? __alloc_skb+0x7a/0x180 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115f4bf>] ? kmem_cache_alloc_node_notrace+0x6f/0x130 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8115f6fb>] ? __kmalloc_node+0x7b/0x100 Mar 16 07:01:21 *****store01 kernel: [<ffffffff81461e65>] ? ip_rcv+0x275/0x350 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8142186a>] ? __alloc_skb+0x7a/0x180 Mar 16 07:01:21 *****store01 kernel: [<ffffffff814219e6>] ? __netdev_alloc_skb+0x36/0x60 Mar 16 07:01:21 *****store01 kernel: [<ffffffffa0188104>] ? ixgbe_alloc_rx_buffers+0x2c4/0x380 [ixgbe] Mar 16 07:01:21 *****store01 kernel: [<ffffffff8127f980>] ? swiotlb_map_page+0x0/0x100 Mar 16 07:01:21 *****store01 kernel: [<ffffffffa0189158>] ? ixgbe_clean_rx_irq+0x818/0x8b0 [ixgbe] Mar 16 07:01:21 *****store01 kernel: [<ffffffffa01895ff>] ? ixgbe_clean_rxtx_many+0x10f/0x220 [ixgbe] Mar 16 07:01:21 *****store01 kernel: [<ffffffff814307c3>] ? net_rx_action+0x103/0x2f0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff810d9390>] ? handle_IRQ_event+0x60/0x170 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8107205a>] ? __do_softirq+0x11a/0x1d0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff81071de5>] ? irq_exit+0x85/0x90 Mar 16 07:01:21 *****store01 kernel: [<ffffffff814f4c85>] ? do_IRQ+0x75/0xf0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 Mar 16 07:01:21 *****store01 kernel: <EOI> [<ffffffff8105673f>] ? finish_task_switch+0x4f/0xe0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff814ec9ce>] ? thread_return+0x4e/0x760 Mar 16 07:01:21 *****store01 kernel: [<ffffffff81123741>] ? __alloc_pages_nodemask+0x111/0x940 Mar 16 07:01:21 *****store01 kernel: [<ffffffff814ed7b2>] ? schedule_timeout+0x192/0x2e0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10 Mar 16 07:01:21 *****store01 kernel: [<ffffffffa0319415>] ? svc_recv+0x5a5/0x850 [sunrpc] Mar 16 07:01:21 *****store01 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20 Mar 16 07:01:21 *****store01 kernel: [<ffffffffa03fcb45>] ? nfsd+0xa5/0x160 [nfsd] Mar 16 07:01:21 *****store01 kernel: [<ffffffffa03fcaa0>] ? nfsd+0x0/0x160 [nfsd] Mar 16 07:01:21 *****store01 kernel: [<ffffffff81090726>] ? kthread+0x96/0xa0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 Mar 16 07:01:21 *****store01 kernel: [<ffffffff81090690>] ? kthread+0x0/0xa0 Mar 16 07:01:21 *****store01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
xfs_info: # xfs_info /data meta-data=/dev/sdb1 isize=256 agcount=99, agsize=268435200 blks = sectsz=512 attr=2 data = bsize=4096 blocks=26367491584, imaxpct=1 = sunit=256 swidth=9216 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
cat /etc/centos-release: CentOS release 6.2 (Final)
uname -a: Linux *****store01 2.6.32-220.4.2.el6.x86_64 #1 SMP Tue Feb 14 04:00:16 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
lspci output: 00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 13) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) 00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 (rev 13) 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 13) 00:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 (rev 13) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 13) 00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 13) 00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 13) 00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 13) 00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 13) 00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02) 00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02) 00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02) 00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02) 00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801IB (ICH9) 2 port SATA Controller [IDE mode] (rev 02) 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) 04:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 04:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 05:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 05:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 06:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 08:03.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a)
/etc/sysctl.conf changes: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 2621440 16777216 net.ipv4.tcp_wmem = 4096 2621440 16777216 net.core.netdev_max_backlog = 250000 net.ipv4.route.flush = 1 net.ipv4.tcp_window_scaling = 1 vm.dirty_writeback_centisecs = 50
Has anyone else seem similar issues? I can provide additional details about the server/configuration if anybody needs anything else. The issue only seems to occur under high write load as we've restored some of these backups and didn't seem to have an issue reading the data.
Thanks all, -Aaron
On Fri, Mar 16, 2012 at 01:33:54PM -0700, Aaron Blew wrote:
Hello all, I'm currently experiencing an issue with an NFS server I've built (a Dell R710 with a Dell PERC H800/LSI 2108 and four external disk trays). It's a backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that mount it's data volume via NFS. It has two 10gig NICs set up in a layer2+3 bond for one network, and two more 10gig NICs set up in the same way in another network. The host has a 99T XFS filesystem for the backups. RPCNFSDCOUNT is set to 256.
During backups from clients the system exhibits odd hangs that interfere with some of our sensitive system's backup windows. On the NFS server side we see the following in dmesg. Originally I thought it was related to dirty writeback cache, but I adjusted dirty_writeback_centisecs and am still seeing the issue.
dmesg during the problem window: Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure.
<snip>
Has anyone else seem similar issues? I can provide additional details about the server/configuration if anybody needs anything else. The issue only seems to occur under high write load as we've restored some of these backups and didn't seem to have an issue reading the data.
The page allocation failure message made me wonder if your issue could be related to the issue I've run into here[1] on RHEL 6.2.
My issue seems to be related to NFS mounting, but it's possible the root cause could be the same?
A few other links:
https://bugzilla.redhat.com/show_bug.cgi?id=593035 http://www.spinics.net/lists/linux-nfs/msg22248.html
Red Hat has provided me with a test kernel which purportedly will resolve the issue. I haven't had a chance to test it out yet.
Ray
UPDATE
I rolled a new kernel that's identical to the stock CentOS 2.6.32-220.el6 kernel with the exception of the new idmapper being enabled. Unfortunately there's been no improvement.
Did you get a chance to try the RHEL kernel?
-Aaron
On Fri, Mar 16, 2012 at 7:01 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
On Fri, Mar 16, 2012 at 01:33:54PM -0700, Aaron Blew wrote:
Hello all, I'm currently experiencing an issue with an NFS server I've built (a Dell R710 with a Dell PERC H800/LSI 2108 and four external disk trays). It's
a
backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that
mount
it's data volume via NFS. It has two 10gig NICs set up in a layer2+3
bond
for one network, and two more 10gig NICs set up in the same way in
another
network. The host has a 99T XFS filesystem for the backups.
RPCNFSDCOUNT
is set to 256.
During backups from clients the system exhibits odd hangs that interfere with some of our sensitive system's backup windows. On the NFS server
side
we see the following in dmesg. Originally I thought it was related to dirty writeback cache, but I adjusted dirty_writeback_centisecs and am still seeing the issue.
dmesg during the problem window: Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure.
<snip>
Has anyone else seem similar issues? I can provide additional details about the server/configuration if anybody needs anything else. The issue only seems to occur under high write load as we've restored some of these backups and didn't seem to have an issue reading the data.
The page allocation failure message made me wonder if your issue could be related to the issue I've run into here[1] on RHEL 6.2.
My issue seems to be related to NFS mounting, but it's possible the root cause could be the same?
A few other links:
https://bugzilla.redhat.com/show_bug.cgi?id=593035 http://www.spinics.net/lists/linux-nfs/msg22248.html
Red Hat has provided me with a test kernel which purportedly will resolve the issue. I haven't had a chance to test it out yet.
Ray
[1] https://bugzilla.redhat.com/show_bug.cgi?id=751992 _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hope to get it installed this weekend.
Ray
On Fri, Mar 30, 2012 at 01:33:10PM -0700, Aaron Blew wrote:
UPDATE
I rolled a new kernel that's identical to the stock CentOS 2.6.32-220.el6 kernel with the exception of the new idmapper being enabled. Unfortunately there's been no improvement.
Did you get a chance to try the RHEL kernel?
-Aaron
On Fri, Mar 16, 2012 at 7:01 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
On Fri, Mar 16, 2012 at 01:33:54PM -0700, Aaron Blew wrote:
Hello all, I'm currently experiencing an issue with an NFS server I've built (a Dell R710 with a Dell PERC H800/LSI 2108 and four external disk trays). It's
a
backup target for Solaris 10, CentOS 5.5 and CentOS 6.2 servers that
mount
it's data volume via NFS. It has two 10gig NICs set up in a layer2+3
bond
for one network, and two more 10gig NICs set up in the same way in
another
network. The host has a 99T XFS filesystem for the backups.
RPCNFSDCOUNT
is set to 256.
During backups from clients the system exhibits odd hangs that interfere with some of our sensitive system's backup windows. On the NFS server
side
we see the following in dmesg. Originally I thought it was related to dirty writeback cache, but I adjusted dirty_writeback_centisecs and am still seeing the issue.
dmesg during the problem window: Mar 16 07:01:21 *****store01 kernel: __ratelimit: 11 callbacks suppressed Mar 16 07:01:21 *****store01 kernel: nfsd: page allocation failure.
<snip>
Has anyone else seem similar issues? I can provide additional details about the server/configuration if anybody needs anything else. The issue only seems to occur under high write load as we've restored some of these backups and didn't seem to have an issue reading the data.
The page allocation failure message made me wonder if your issue could be related to the issue I've run into here[1] on RHEL 6.2.
My issue seems to be related to NFS mounting, but it's possible the root cause could be the same?
A few other links:
https://bugzilla.redhat.com/show_bug.cgi?id=593035 http://www.spinics.net/lists/linux-nfs/msg22248.html
Red Hat has provided me with a test kernel which purportedly will resolve the issue. I haven't had a chance to test it out yet.
Ray
On Fri, Mar 30, 2012 at 01:33:10PM -0700, Aaron Blew wrote:
UPDATE
I rolled a new kernel that's identical to the stock CentOS 2.6.32-220.el6 kernel with the exception of the new idmapper being enabled. Unfortunately there's been no improvement.
Did you get a chance to try the RHEL kernel?
-Aaron
FYI have been running 2.6.32-251.el6.x86_64 all weekend and thus far my issues appear to have fixed.
Ray