[CentOS] kerberized-nfs - any experts out there?

Wed Mar 22 21:25:27 UTC 2017
James A. Peltier <jpeltier at sfu.ca>

Feel free to contact me offline if you wish.  I'll just go on record as saying that it's a bear

----- On 22 Mar, 2017, at 12:26, Matt Garman matthew.garman at gmail.com wrote:

| Is anyone on the list using kerberized-nfs on any kind of scale?
| 
| I've been fighting with this for years.  In general, when we have
| issues with this system, they are random and/or not repeatable.  I've
| had very little luck with community support.  I hope I don't offend by
| saying that!  Rather, my belief is that these problems are very
| niche/esoteric, and so beyond the scope of typical community support.
| But I'd be delighted to be proven wrong!
| 
| So this is more of a "meta" question: anyone out there have any
| general recommendations for how to get support on what I presume are
| niche problems specific to our environment?  How is paid upstream
| support?
| 
| Just to give a little insight into our issues: we have an
| in-house-developed compute job dispatching system.  Say a user has
| 100s of analysis jobs he wants to run, he submits them to a central
| master process, which in turn dispatches them to a "farm" of >100
| compute nodes.  All these nodes have two different krb5p NFS mounts,
| to which the jobs will read and write.  So while the users can
| technically log in directly to the compute nodes, in practice they
| never do.  The logins are only "implicit" when the job dispatching
| system does a behind-the-scenes ssh to kick off these processes.
| 
| Just to give some "flavor" to the kinds of issues we're facing, what
| tends to crop up are one of three things:
| 
|    (1) Random crashes.  These are full-on kernel trace dumps followed
| by an automatic reboot.  This was really bad under CentOS 5.  A random
| kernel upgrade magically fixed it.  It happens almost never under
| CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
| completely off CentOS 5 now, BTW.)
| 
|    (2) Permission denied issues.  I have user Kerberos tickets
| configured for 70 days.  But there is clearly some kind of
| undocumented kernel caching going on.  Looking at the Kerberos server
| logs, it looks like it "could" be a performance issue, as I see 100s
| of ticket requests within the same second when someone tries to launch
| a lot of jobs.  Many of these will fail with "permission denied" but
| if they immediately re-try, it works.  Related to this, I have been
| unable to figure out what creates and deletes the
| /tmp/krb5cc_uid_random files.
| 
|    (3) Kerberized NFS shares getting "stuck" for one or more users.
| We have another monitoring app (in-house developed) that, among other
| things, makes periodic checks of these NFS mounts.  It does so by
| forking and doing a simple "ls" command.  This is to ensure that these
| mounts are alive and well.  Sometimes, the "ls" command gets stuck to
| the point where it can't even be killed via "kill -9".  Only a reboot
| fixes it.  But the mount is only stuck for the user running the
| monitoring app.  Or sometimes the monitoring app is fine, but an
| actual user's processes will get stuck in "D" state (in top, means
| waiting on IO), but everyone else's jobs (and access to the kerberizes
| nfs shares) are OK.
| 
| This is actually blocking us from upgrading to CentOS 7.  But my
| colleagues and I are at a loss how to solve this.  So this post is
| really more of a semi-desperate plea for any kind of advice.  What
| other resources might we consider?  Paid support is not out of the
| question (within reason).  Are there any "super specialist"
| consultants out there who deal in Kerberized NFS?
| 
| Thanks!
| Matt
| _______________________________________________
| CentOS mailing list
| CentOS at centos.org
| https://lists.centos.org/mailman/listinfo/centos

-- 
James A. Peltier
IT Services - Research Computing Group
Simon Fraser University - Burnaby Campus
Phone   : 604-365-6432
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.sfu.ca/itservices
Twitter : @sfu_rcg
Powering Engagement Through Technology