On Mon, Oct 24, 2016 at 7:51 AM, mark m.roth@5-cent.us wrote:
On 10/24/16 03:52, Larry Martell wrote:
On Fri, Oct 21, 2016 at 11:42 AM, m.roth@5-cent.us wrote:
Larry Martell wrote:
On Fri, Oct 21, 2016 at 11:21 AM, m.roth@5-cent.us wrote:
Larry Martell wrote:
We have 1 system ruining Centos7 that is the NFS server. There are 50 external machines that FTP files to this server fairly continuously.
We have another system running Centos6 that mounts the partition the files are FTP-ed to using NFS.
<snip> >>>> >>>> What filesystem?
<snip> >> >> cat /etc/fstab on the systems, and see what they are. If either is xfs, >> and assuming that the systems are on UPSes, then the fstab which controls >> drive mounting on a system should have, instead of "defaults", >> nobarrier,inode64. > > > The server is xfs (the client is nfs). The server does have inode64 > specified, but not nobarrier. > >> Note that the inode64 is relevant if the filesystem is > 2TB. > > > The file system is 51TB. > >> The reason I say this is that we we started rolling out CentOS 7, we >> tried >> to put one of our user's home directory on one, and it was a disaster. >> 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took >> seven minutes, where before, it had taken 30 seconds. Timed. It took us >> months to discover that NFS 4 tries to make transactions atomic, which is >> fine if you're worrying about losing power or connectivity. If you're on >> a >> UPS, and hardwired, adding the nobarrier immediately brought it down to >> 40 >> seconds or so. > > > We are not seeing a performance issue - do you think nobarrier would > help with our lock up issue? I wanted to try it but my client did not > want me to make any changes until we got the bad disk replaced. > Unfortunately that will not happen until Wednesday.
Absolutely add nobarrier, and see what happens.
Finally got to add nobarrier (I'll skip why it took so long), and it looks like this just caused the problem to morph a bit.
On the C7 NFS server, besides having 50 external machines ftp-ing files to it, we run 2 jobs: 1 that moves files around (called image_mover) and one that changes perms on some files (called chmod_job).
And on the C6 NFS client, besides the job that was hanging (called the importer), we also run a another job (called ftp_job) that ftps files to the C6 machine. The ftp_job had never hung before, but now the importer that used to hang has not (yet) hung, and the ftp_job that had not hung before now is hanging.
But the system messages are different.
On the C7 server there is a series of messages of the form 'task blocked for >120 seconds' with a stack trace. There is one for each of the following:
nfsd, chmod_job, kworker, pure_ftpd, image_mover
In each of the stack traces they are blocked on either nfs_write or nfs_flush
And on the C6 client there is a similar blocked message for the ftp job, blocked on nfs_flush, then the bad sequence number message I had seen before, and at that point the ftp_job hung.