On Thu, Oct 27, 2016 at 4:23 PM, Matt Garman <matthew.garman at gmail.com> wrote: > On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at gmail.com> wrote: >> This site is locked down like no other I have ever seen. You cannot >> bring anything into the site - no computers, no media, no phone. You >> ... >> This is my client's client, and even if I could circumvent their >> policy I would not do that. They have a zero tolerance policy and if >> ... > > OK, no internet for real. :) Sorry I kept pushing this. I made an > unflattering assumption that maybe it just hadn't occurred to you how > to get files in or out. Sometimes there are "soft" barriers to > bringing files in or out: they don't want it to be trivial, but want > it to be doable if necessary. But then there are times when they > really mean it. I thought maybe the former applied to you, but > clearly it's the latter. Apologies. > >> These are all good debugging techniques, and I have tried some of >> them, but I think the issue is load related. There are 50 external >> machines ftp-ing to the C7 server, 24/7, thousands of files a day. And >> on the C6 client the script that processes them is running >> continuously. It will sometimes run for 7 hours then hang, but it has >> run for as long as 3 days before hanging. I have never been able to >> reproduce the errors/hanging situation manually. > > If it truly is load related, I'd think you'd see something askew in > the sar logs. But if the load tends to spike, rather than be > continuous, the sar sampling rate may be too coarse to pick it up. > >> And again, this is only at this site. We have the same software >> deployed at 10 different sites all doing the same thing, and it all >> works fine at all of those. > > Flaky hardware can also cause weird intermittent issues. I know you > mentioned before your hardware is fairly new/decent spec; but that > doesn't make it immune to manufacturing defects. For example, imagine > one voltage regulator that's ever-so-slightly out of spec. It > happens. Bad memory is not uncommon and certainly causes all kinds of > mysterious issues (though in my experience that tends to result in > spontaneous reboots or hard lockups, but truly anything could happen). > > Ideally, you could take the system offline and run hardware > diagnostics, but I suspect that's impossible given your restrictions > on taking things in/out of the datacenter. > > On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at gmail.com> wrote: >> Well I spoke too soon. The importer (the one that was initially >> hanging that I came here to fix) hung up after running 20 hours. There >> were no NFS errors or messages on neither the client nor the server. >> When I restarted it, it hung after 1 minute, Restarted it again and it >> hung after 20 seconds. After that when I restarted it it hung >> immediately. Still no NFS errors or messages. I tried running the >> process on the server and it worked fine. So I have to believe this is >> related to nobarrier. Tomorrow I will try removing that setting, but I >> am no closer to solving this and I have to leave Japan Saturday :-( >> >> The bad disk still has not been replaced - that is supposed to happen >> tomorrow, but I won't have enough time after that to draw any >> conclusions. > > I've seen behavior like that with disks that are on their way out... > basically the system wants to read a block of data, and the disk > doesn't read it successfully, so it keeps trying. The kind of disk, > what kind of controller it's behind, raid level, and various other > settings can all impact this phenomenon, and also how much detail you > can see about it. You already know you have one bad disk, so that's > kind of an open wound that may or may not be contributing to your > bigger, unsolved problem. Just replaced the disk but I am leaving tomorrow so it was decided that we will run the process on the C7 server, at least for now. I will probably have to come back here early next year and revisit this. We are thinking of building a new system back in NY and shipping it here and swapping them out. > > So that makes me think, you can also do some basic disk benchmarking. > iozone and bonnie++ are nice, but I'm guessing they're not installed > and you don't have a means to install them. But you can use "dd" to > do some basic benchmarking, and that's all but guaranteed to be > installed. Similar to network benchmarking, you can do something > like: > time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count=256 > > That will generate a 256 GB file. Adjust "bs" and "count" to whatever > makes sense. General rule of thumb is you want the target file to be > at least 2x the amount of RAM in the system to avoid cache effects > from skewing your results. Bigger is even better if you have the > space, as it increases the odds of hitting the "bad" part of the disk > (if indeed that's the source of your problem). > > Do that on C6, C7, and if you can a similar machine as a "control" > box, it would be ideal. Again, we're looking for outliers, hang-ups, > timeouts, etc. > > +1 to Gordon's suggestion to sanity check MTU sizes. > > Another random possibility... By somewhat funny coincidence, we have > some servers in Japan as well, and were recently banging our heads > against the wall with some weird networking issues. The remote hands > we had helping us (none of our staff was on site) claimed one or more > fiber cables were dusty, enough that it was affecting light levels. > They cleaned the cables and the problems went away. Anyway, if you > have access to the switches, you should be able to check that light > levels are within spec. No switches - it's an internal, virtual network between the server and the virtualized client. > If you have the ability to take these systems offline temporarily, you > can also run "fsck" (file system check) on the C6 and C7 file systems. > IIRC, ext4 can do a very basic kind of check on a mounted filesystem. > But a deeper/more comprehensive scan requires the FS to be unmounted. > Not sure what the rules are for xfs. But C6 uses ext4 by default so > you could probably at least run the basic check on that without taking > the system offline. The systems were rebooted 2 days ago and fsck was run at boot time, and was clean. Thanks for all your help with trying to solve this. It was a very frustrating time for me - I was here for 10 days and did not really discover anything about the problem. Hopefully running the process on the server will keep it going and keep the customer happy. I will update this thread when/if I revisit it.