[CentOS] kerberized-nfs - any experts out there?

Wed Mar 22 19:26:53 UTC 2017
Matt Garman <matthew.garman at gmail.com>

Is anyone on the list using kerberized-nfs on any kind of scale?

I've been fighting with this for years.  In general, when we have
issues with this system, they are random and/or not repeatable.  I've
had very little luck with community support.  I hope I don't offend by
saying that!  Rather, my belief is that these problems are very
niche/esoteric, and so beyond the scope of typical community support.
But I'd be delighted to be proven wrong!

So this is more of a "meta" question: anyone out there have any
general recommendations for how to get support on what I presume are
niche problems specific to our environment?  How is paid upstream

Just to give a little insight into our issues: we have an
in-house-developed compute job dispatching system.  Say a user has
100s of analysis jobs he wants to run, he submits them to a central
master process, which in turn dispatches them to a "farm" of >100
compute nodes.  All these nodes have two different krb5p NFS mounts,
to which the jobs will read and write.  So while the users can
technically log in directly to the compute nodes, in practice they
never do.  The logins are only "implicit" when the job dispatching
system does a behind-the-scenes ssh to kick off these processes.

Just to give some "flavor" to the kinds of issues we're facing, what
tends to crop up are one of three things:

    (1) Random crashes.  These are full-on kernel trace dumps followed
by an automatic reboot.  This was really bad under CentOS 5.  A random
kernel upgrade magically fixed it.  It happens almost never under
CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
completely off CentOS 5 now, BTW.)

    (2) Permission denied issues.  I have user Kerberos tickets
configured for 70 days.  But there is clearly some kind of
undocumented kernel caching going on.  Looking at the Kerberos server
logs, it looks like it "could" be a performance issue, as I see 100s
of ticket requests within the same second when someone tries to launch
a lot of jobs.  Many of these will fail with "permission denied" but
if they immediately re-try, it works.  Related to this, I have been
unable to figure out what creates and deletes the
/tmp/krb5cc_uid_random files.

    (3) Kerberized NFS shares getting "stuck" for one or more users.
We have another monitoring app (in-house developed) that, among other
things, makes periodic checks of these NFS mounts.  It does so by
forking and doing a simple "ls" command.  This is to ensure that these
mounts are alive and well.  Sometimes, the "ls" command gets stuck to
the point where it can't even be killed via "kill -9".  Only a reboot
fixes it.  But the mount is only stuck for the user running the
monitoring app.  Or sometimes the monitoring app is fine, but an
actual user's processes will get stuck in "D" state (in top, means
waiting on IO), but everyone else's jobs (and access to the kerberizes
nfs shares) are OK.

This is actually blocking us from upgrading to CentOS 7.  But my
colleagues and I are at a loss how to solve this.  So this post is
really more of a semi-desperate plea for any kind of advice.  What
other resources might we consider?  Paid support is not out of the
question (within reason).  Are there any "super specialist"
consultants out there who deal in Kerberized NFS?