[CentOS] NFS help

On Fri, Oct 21, 2016 at 4:14 AM, Larry Martell <larry.martell at gmail.com> wrote:
> We have 1 system ruining Centos7 that is the NFS server. There are 50
> external machines that FTP files to this server fairly continuously.
>
> We have another system running Centos6 that mounts the partition the files
> are FTP-ed to using NFS.
>
> There is a python script running on the NFS client machine that is reading
> these files and moving them to a new dir on the same file system (a mv not
> a cp).

To be clear: the python script is moving files on the same NFS file
system?  E.g., something like

    mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file

where /mnt/nfs-server is the mount point of the NFS server on the
client machine?

Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client?

If the former, i.e., you are moving files to and from the same system,
is it possible to completely eliminate the C6 client system, and just
set up a local script on the C7 server that does the file moves?  That
would cut out a lot of complexity, and also improve performance
dramatically.

Also, what is the size range of these files?  Are they fairly small
(e.g. 10s of MB or less), medium-ish (100s of MB) or large (>1GB)?

> Almost daily this script hangs while reading a file - sometimes it never
> comes back and cannot be killed, even with -9. Other times it hangs for 1/2
> hour then proceeds on.

Timeouts relating to NFS are the worst.

> Coinciding with the hanging I see this message on the NFS server host:
>
> nfsd: peername failed (error 107)
>
> And on the NFS client host I see this:
>
> nfs: V4 server returned a bad sequence-id
> nfs state manager - check lease failed on NFSv4 server with error 5

I've been wrangling with NFS for years, but unfortunately those
particular messages don't ring a bell.

The first thing that came to my mind is: how does the Python script
running on the C6 client know that the FTP upload to the C7 server is
complete?  In other words, if someone is uploading "fileA", and the
Python script starts to move "fileA" before the upload is complete,
then at best you're setting yourself up for all kinds of confusion,
and at worst file truncation and/or corruption.

Making a pure guess about those particular errors: is there any chance
there is a network issue between the C7 server and the C6 client?
What is the connection between those two servers?  Are they physically
adjacent to each other and on the same subnet?  Or are they on
opposite ends of the globe connected through the Internet?

Clearly two machines on the same subnet, separated only by one switch
is the simplest case (i.e. the kind of simple LAN one might have in
his home).  But once you start crossing subnets, then routing configs
come into play.  And maybe you're using hostnames rather than IP
addresses directly, so then name resolution comes into play (DNS or
/etc/hosts).  And each switch hop you add requires that not only your
server network config needs to be correct, but also your switch config
needs to be correct as well.  And if you're going over the Internet,
well... I'd probably try really hard to not use NFS in that case!  :)

Do you know if your NFS mount is using TCP or UDP?  On the client you
can do something like this:

    grep nfs /proc/mounts | less -S

And then look at what the "proto=XXX" says.  I expect it will be
either "tcp" or "udp".  If it's UDP, modify your /etc/fstab so that
the options for that mountpoint include "proto=tcp".  I *think* the
default is now TCP, so this may be a non-starter.  But the point is,
based purely on the conjecture that you might have an unreliable
network, TCP would be a better fit.

I hate to simply say "RTFM", but NFS is complex, and I still go back
and re-read the NFS man page ("man nfs").  This document is long and
very dense, but it's worth at least being familiar with its content.

> The first client message is always at the same time as the hanging starts.
> The second client message comes 20 minutes later.
> The server message comes 4 minutes after that.
> Then 3 minutes later the script un-hangs (if it's going to).

In my experience, delays that happen on consistent time intervals that
are on the order of minutes tend to smell of some kind of timeout
scenario.  So the question is, what triggers the timeout state?

> Can anyone shed any light on to what could be happening here and/or what I
> could do to alleviate these issues and stop the script from hanging?
> Perhaps some NFS config settings? We do not have any, so we are using the
> defaults.

My general rule of thumb is "defaults are generally good enough; make
changes only if you understand their implications and you know you
need them (or temporarily as a diagnostic tool)".

But anyway, my hunch is that there might be a network issue.  So I'd
actually start with basic network troubleshooting.  Do an "ifconfig"
on both machines: do you see any drops or interface errors?  Do
"ethtool <interface>" on both machines to make sure both are linked up
at the correct speed and duplex.  Use a tool like netperf to check
bandwidth between both hosts.  Look at the actual detailed stats, do
you see huge outliers or timeouts?  Do the test with both TCP and UDP,
performance should be similar with a (typically) slight gain with UDP.
Do you see drops with UDP?

What's the state of the hardware?  Are they ancient machines cobbled
together from spare parts, or reasonable decent machines?  Do they
have adequate cooling and power?  Is there any chance they are
overheating (even briefly) or possibly being fed unclean power (e.g.
small voltage aberrations)?

Oh, also, look at the load on the two machines... are these
purpose-built servers, or are they used for other numerous tasks?
Perhaps one or both is overloaded.  top is the tool we use
instinctively, but also take a look at vmstat and iostat.  Oh, also
check "free", make sure neither machine is swapping (thrashing).  If
you're not already doing this, I would recommend setting up "sar"
(from the package "sysstat") and setting up more granular logging than
the default.  sar is kind of like a continuous
iostat+free+top+vmstat+other system load tools rolled into one that
continually writes this information to a database.  So for example,
next time this thing happens, you can look at the sar logs to see if
any particular metric went significantly out-of-whack.

That's all I can think of for now.  Best of luck.  You have my
sympathy... I've been administering Linux both as a hobby and
professionally for longer than I care to admit, and NFS still scares
me.  Just be thankful you're not using Kerberized NFS.  ;)