[CentOS] NFS Server on Centos 7.6.1810 dying without reason

Fri Aug 30 21:25:40 UTC 2019
Erick Perez - Quadrian Enterprises <eperez at quadrianweb.com>

I managed to get NFS statistics just before server rebooted.
I can reproduce this every time.
Question: Am I saturating the SATA bus on the NFS server so violently
with the SSDs that the server hard reboots?
Perhaps due to the RAID 6 nature of operations?


Server info
AMD EPYC 7281 16CORE
32 GIG RAM
6xSATA3 SSD 1.9TB in mdadm RAID6
OS: Centos 7.6
SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA
Controller [AHCI mode] (rev 51)


NFS server parameters:
/etc/sysconfig/nfs  RPCNFSDCOUNT=32
all other parameters are as default in a Centos 7.6 install

>From NFS client, share is mounted as:
10.10.10.2:/mnt/lv_vdo0 on
/rhev/data-center/mnt/10.10.10.2:_mnt_lv__vdo0 type nfs4
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=10.10.10.5,local_lock=none,addr=10.10.10.2)

1- My test:
a- mount nfs share A1 from nfs server in nfs client
b- mount nfs share B1 from SAME nfs server in SAME nfs client
c- copy file1.iso from /mnt/A1/file1.iso to /mnt/B1/file1.iso

Notes:
if file is small, tried with 1gb iso images, then no problem happens.
if file is bigger tried with 3gb iso, nfs server hard reboots
before server hard reboots, iostats show 16 nfsd worker threads and
211.03 M/s READS and 223.68 M/s WRITES sustained

2- from NFS server these are the last logs...
At 14:26:52 it is the last line before the NFS server suddenly hard reboots.

Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 1
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 1641
slot_seqid 1640
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080
opcnt 3 #3: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8bd7bbd000
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #3: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8bd7bba000
Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419)
Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE)
Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl:
1567181869:314250265:6:0
Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12545
slot_seqid 12544
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE)
Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl:
1567181869:314250265:6:0
Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 1
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 1642
slot_seqid 1641
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080
opcnt 3 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080
opcnt 3 #3: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8bd7bba000
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #3: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8bd7bbd000
Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419)
Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/4: 53 (OP_SEQUENCE)
Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl:
1567181869:314250264:5:0
Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 8036
slot_seqid 8035
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 4 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/4: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe
a744333d d3d8b5a0 d514631c 04000440)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 4 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/4: 3 (OP_ACCESS)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe
a744333d d3d8b5a0 d514631c 04000440)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 4 #3: 3: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #4/4: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe
a744333d d3d8b5a0 d514631c 04000440)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 4 #4: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8b38ab0000
Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1418)
Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE)
Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl:
1567181869:314250265:6:0
Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12546
slot_seqid 12545
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #3: 9: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0
Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot
ffff9c8bd7bbd000
Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419)
Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE)
Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl:
1567181869:314250265:6:0
Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0
Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12547
slot_seqid 12546
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #1: 53: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080
opcnt 3 #2: 22: status 0
Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR)
Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55
5a4051e2 b5f423b9 801ff55f 00000000)
Aug 30 14:26:52 hvmnas01 kernel: do_IRQ: 1.35 No irq handler for vector (irq -1)
Aug 30 14:26:52 hvmnas01 rsyslogd: imjournal: 233164 messages lost due
to rate-limiting
Aug 30 14:30:01 hvmnas01 systemd: Started Session 38 of user root.
Aug 30 14:38:29 hvmnas01 journal: Runtime journal is using 8.0M (max
allowed 1.5G, trying to leave 2.3G free of 15.5G available → current
limit 1.5G).
Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpuset
Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpu
Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpuacct
Aug 30 14:38:29 hvmnas01 kernel: Linux version
3.10.0-957.27.2.el7.x86_64 (mockbuild at kbuilder.bsys.centos.org) (gcc
version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Mon Jul 29
17:46:05 UTC 2019
Aug 30 14:38:29 hvmnas01 kernel: Command line:
BOOT_IMAGE=/vmlinuz-3.10.0-957.27.2.el7.x86_64
root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root
rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8 skew_tick=1
Aug 30 14:38:29 hvmnas01 kernel: e820: BIOS-provided physical RAM map:

---------------------
Erick Perez
---------------------

On Fri, Aug 30, 2019 at 11:27 AM Erick Perez - Quadrian Enterprises
<eperez at quadrianweb.com> wrote:
>
> Good morning,
> in order to post proper documentation, what logs (or log level) do I
> need to troubleshoot a Centos 7.6.1810 3.10.0-957.27.2.el7.x86_64 tha
> tis running a NFS server on top LVM on top of XFS on top of VDO on top
> of MDAM on a 6 SSD disk RAID6 ?
>
> This physical NFS server is servign 2 NFS v4.2 shares to 2 physical
> KVM virtualization hosts.
>
> When remote NFS clients start doing intensive stuff (massive writes) ,
> the NFS server crashes and sometimes make the Linux server reboot.
>
> form KVM side this is all I get: hvm002 kernel: nfs: server 10.10.10.2
> not responding, timed out
>
> thanks
>
> ---------------------
> Erick
> ---------------------