[CentOS] NFS help

Thu Oct 27 08:05:21 UTC 2016
Larry Martell <larry.martell at gmail.com>

On Thu, Oct 27, 2016 at 1:03 AM, Larry Martell <larry.martell at gmail.com> wrote:
> On Wed, Oct 26, 2016 at 9:35 AM, Matt Garman <matthew.garman at gmail.com> wrote:
>> On Tue, Oct 25, 2016 at 7:22 PM, Larry Martell <larry.martell at gmail.com> wrote:
>>> Again, no machine on the internal network that my 2 CentOS hosts are
>>> on are connected to the internet. I have no way to download anything.,
>>> There is an onerous and protracted process to get files into the
>>> internal network and I will see if I can get netperf in.
>>
>> Right, but do you have physical access to those machines?  Do you have
>> physical access to the machine which on which you use PuTTY to connect
>> to those machines?  If yes to either question, then you can use
>> another system (that does have Internet access) to download the files
>> you want, put them on a USB drive (or burn to a CD, etc), and bring
>> the USB/CD to the C6/C7/PuTTY machines.
>
> This site is locked down like no other I have ever seen. You cannot
> bring anything into the site - no computers, no media, no phone. You
> have to empty your pockets and go through an airport type naked body
> scan.
>
>> There's almost always a technical way to get files on to (or out of) a
>> system.  :)  Now, your company might have *policies* that forbid
>> skirting around the technical measures that are in place.
>
> This is my client's client, and even if I could circumvent their
> policy I would not do that. They have a zero tolerance policy and if
> you are caught violating it you are banned for life from the company.
> And that would not make my client happy.
>
>> Here's another way you might be able to test network connectivity
>> between C6 and C7 without installing new tools: see if both machines
>> have "nc" (netcat) installed.  I've seen this tool referred to as "the
>> swiss army knife of network testing tools", and that is indeed an apt
>> description.  So if you have that installed, you can hit up the web
>> for various examples of its use.  It's designed to be easily scripted,
>> so you can write your own tests, and in theory implement something
>> similar to netperf.
>>
>> OK, I just thought of another "poor man's" way to at least do some
>> sanity testing between C6 and C7: scp.  First generate a huge file.
>> General rule of thumb is at least 2x the amount of RAM in the C7 host.
>> You could create a tarball of /usr, for example (e.g. "tar czvf
>> /tmp/bigfile.tar.gz /usr" assuming your /tmp partition is big enough
>> to hold this).  Then, first do this: "time scp /tmp/bigfile.tar.gz
>> localhost:/tmp/bigfile_copy.tar.gz".  This will literally make a copy
>> of that big file, but will route through most of of the network stack.
>> Make a note of how long it took.  And also be sure your /tmp partition
>> is big enough for two copies of that big file.
>>
>> Now, repeat that, but instead of copying to localhost, copy to the C6
>> box.  Something like: "time scp /tmp/bigfile.tar.gz <IP address of C6
>> host>:/tmp/".  Does the time reported differ greatly from when you
>> copied to localhost?  I would expect them to be reasonably close.
>> (And this is another reason why you want a fairly large file, so the
>> transfer time is dominated by actual file transfer, rather than the
>> overhead.)
>>
>> Lastly, do the reverse test: log in to the C6 box, and copy the file
>> back to C7, e.g. "time scp /tmp/bigfile.tar.gz <IP of C7
>> host>:/tmp/bigfile_copy2.tar.gz".  Again, the time should be
>> approximately the same for all three transfers.  If either or both of
>> the latter two copies take dramatically longer than the first, then
>> there's a good chance something is askew with the network config
>> between C6 and C7.
>>
>> Oh... all this time I've been jumping to fancy tests.  Have you tried
>> the simplest form of testing, that is, doing by hand what your scripts
>> do automatically?  In other words, simply try copying files between C6
>> and C7 using the existing NFS config?  Can you manually trigger the
>> errors/timeouts you initially posted?  Is it when copying lots of
>> small files?  Or when you copy a single huge file?  Any kind of file
>> copying "profile" you can determine that consistently triggers the
>> error?  That could be another clue.
>
> These are all good debugging techniques, and I have tried some of
> them, but I think the issue is load related. There are 50 external
> machines ftp-ing to the C7 server, 24/7, thousands of files a day. And
> on the C6 client the script that processes them is running
> continuously. It will sometimes run for 7 hours then hang, but it has
> run for as long as 3 days before hanging. I have never been able to
> reproduce the errors/hanging situation manually.
>
> And again, this is only at this site. We have the same software
> deployed at 10 different sites all doing the same thing, and it all
> works fine at all of those.

Well I spoke too soon. The importer (the one that was initially
hanging that I came here to fix) hung up after running 20 hours. There
were no NFS errors or messages on neither the client nor the server.
When I restarted it, it hung after 1 minute, Restarted it again and it
hung after 20 seconds. After that when I restarted it it hung
immediately. Still no NFS errors or messages. I tried running the
process on the server and it worked fine. So I have to believe this is
related to nobarrier. Tomorrow I will try removing that setting, but I
am no closer to solving this and I have to leave Japan Saturday :-(

The bad disk still has not been replaced - that is supposed to happen
tomorrow, but I won't have enough time after that to draw any
conclusions.