[CentOS] NFS help

On Thu, Oct 27, 2016 at 4:23 PM, Matt Garman <matthew.garman at gmail.com> wrote:
> On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at gmail.com> wrote:
>> This site is locked down like no other I have ever seen. You cannot
>> bring anything into the site - no computers, no media, no phone. You
>> ...
>> This is my client's client, and even if I could circumvent their
>> policy I would not do that. They have a zero tolerance policy and if
>> ...
>
> OK, no internet for real. :) Sorry I kept pushing this.  I made an
> unflattering assumption that maybe it just hadn't occurred to you how
> to get files in or out.  Sometimes there are "soft" barriers to
> bringing files in or out: they don't want it to be trivial, but want
> it to be doable if necessary.  But then there are times when they
> really mean it.  I thought maybe the former applied to you, but
> clearly it's the latter.  Apologies.
>
>> These are all good debugging techniques, and I have tried some of
>> them, but I think the issue is load related. There are 50 external
>> machines ftp-ing to the C7 server, 24/7, thousands of files a day. And
>> on the C6 client the script that processes them is running
>> continuously. It will sometimes run for 7 hours then hang, but it has
>> run for as long as 3 days before hanging. I have never been able to
>> reproduce the errors/hanging situation manually.
>
> If it truly is load related, I'd think you'd see something askew in
> the sar logs.  But if the load tends to spike, rather than be
> continuous, the sar sampling rate may be too coarse to pick it up.
>
>> And again, this is only at this site. We have the same software
>> deployed at 10 different sites all doing the same thing, and it all
>> works fine at all of those.
>
> Flaky hardware can also cause weird intermittent issues.  I know you
> mentioned before your hardware is fairly new/decent spec; but that
> doesn't make it immune to manufacturing defects.  For example, imagine
> one voltage regulator that's ever-so-slightly out of spec.  It
> happens.  Bad memory is not uncommon and certainly causes all kinds of
> mysterious issues (though in my experience that tends to result in
> spontaneous reboots or hard lockups, but truly anything could happen).
>
> Ideally, you could take the system offline and run hardware
> diagnostics, but I suspect that's impossible given your restrictions
> on taking things in/out of the datacenter.
>
> On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at gmail.com> wrote:
>> Well I spoke too soon. The importer (the one that was initially
>> hanging that I came here to fix) hung up after running 20 hours. There
>> were no NFS errors or messages on neither the client nor the server.
>> When I restarted it, it hung after 1 minute, Restarted it again and it
>> hung after 20 seconds. After that when I restarted it it hung
>> immediately. Still no NFS errors or messages. I tried running the
>> process on the server and it worked fine. So I have to believe this is
>> related to nobarrier. Tomorrow I will try removing that setting, but I
>> am no closer to solving this and I have to leave Japan Saturday :-(
>>
>> The bad disk still has not been replaced - that is supposed to happen
>> tomorrow, but I won't have enough time after that to draw any
>> conclusions.
>
> I've seen behavior like that with disks that are on their way out...
> basically the system wants to read a block of data, and the disk
> doesn't read it successfully, so it keeps trying.  The kind of disk,
> what kind of controller it's behind, raid level, and various other
> settings can all impact this phenomenon, and also how much detail you
> can see about it.  You already know you have one bad disk, so that's
> kind of an open wound that may or may not be contributing to your
> bigger, unsolved problem.

Just replaced the disk but I am leaving tomorrow so it was decided
that we will run the process on the C7 server, at least for now. I
will probably have to come back here early next year and revisit this.
We are thinking of building a new system back in NY and shipping it
here and swapping them out.
>
> So that makes me think, you can also do some basic disk benchmarking.
> iozone and bonnie++ are nice, but I'm guessing they're not installed
> and you don't have a means to install them.  But you can use "dd" to
> do some basic benchmarking, and that's all but guaranteed to be
> installed.  Similar to network benchmarking, you can do something
> like:
>     time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count=256
>
> That will generate a 256 GB file.  Adjust "bs" and "count" to whatever
> makes sense.  General rule of thumb is you want the target file to be
> at least 2x the amount of RAM in the system to avoid cache effects
> from skewing your results.  Bigger is even better if you have the
> space, as it increases the odds of hitting the "bad" part of the disk
> (if indeed that's the source of your problem).
>
> Do that on C6, C7, and if you can a similar machine as a "control"
> box, it would be ideal.  Again, we're looking for outliers, hang-ups,
> timeouts, etc.
>
> +1 to Gordon's suggestion to sanity check MTU sizes.
>
> Another random possibility... By somewhat funny coincidence, we have
> some servers in Japan as well, and were recently banging our heads
> against the wall with some weird networking issues.  The remote hands
> we had helping us (none of our staff was on site) claimed one or more
> fiber cables were dusty, enough that it was affecting light levels.
> They cleaned the cables and the problems went away.  Anyway, if you
> have access to the switches, you should be able to check that light
> levels are within spec.

No switches - it's an internal, virtual network between the server and
the virtualized client.

> If you have the ability to take these systems offline temporarily, you
> can also run "fsck" (file system check) on the C6 and C7 file systems.
> IIRC, ext4 can do a very basic kind of check on a mounted filesystem.
> But a deeper/more comprehensive scan requires the FS to be unmounted.
> Not sure what the rules are for xfs.  But C6 uses ext4 by default so
> you could probably at least run the basic check on that without taking
> the system offline.

The systems were rebooted 2 days ago and fsck was run at boot time,
and was clean.

Thanks for all your help with trying to solve this. It was a very
frustrating time for me - I was here for 10 days and did not really
discover anything about the problem. Hopefully running the process on
the server will keep it going and keep the customer happy.

I will update this thread when/if I revisit it.