[CentOS] kerberized-nfs - any experts out there?

Wed Mar 22 20:19:21 UTC 2017
m.roth at 5-cent.us <m.roth at 5-cent.us>

Matt Garman wrote:
> Is anyone on the list using kerberized-nfs on any kind of scale?
We use it here. I don't think I'm an expert - my manager is - but let me
think about your issues.
> Just to give a little insight into our issues: we have an
> in-house-developed compute job dispatching system.  Say a user has
> 100s of analysis jobs he wants to run, he submits them to a central
> master process, which in turn dispatches them to a "farm" of >100
> compute nodes.  All these nodes have two different krb5p NFS mounts,
> to which the jobs will read and write.  So while the users can
> technically log in directly to the compute nodes, in practice they
> never do.  The logins are only "implicit" when the job dispatching
> system does a behind-the-scenes ssh to kick off these processes.

I would strongly recommend that you look into slurm. It's being used here
in both large and small scale, and is explicitly for that purpose.
> Just to give some "flavor" to the kinds of issues we're facing, what
> tends to crop up are one of three things:
>     (1) Random crashes.  These are full-on kernel trace dumps followed
> by an automatic reboot.  This was really bad under CentOS 5.  A random
> kernel upgrade magically fixed it.  It happens almost never under
> CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
> completely off CentOS 5 now, BTW.)

This may possibly be another issue.
>     (2) Permission denied issues.  I have user Kerberos tickets
> configured for 70 days.  But there is clearly some kind of
> undocumented kernel caching going on.  Looking at the Kerberos server
> logs, it looks like it "could" be a performance issue, as I see 100s
> of ticket requests within the same second when someone tries to launch
> a lot of jobs.  Many of these will fail with "permission denied" but
> if they immediately re-try, it works.  Related to this, I have been
> unable to figure out what creates and deletes the
> /tmp/krb5cc_uid_random files.

Are they asking for *new* credentials each time? They should only be doing
one kinit.
>     (3) Kerberized NFS shares getting "stuck" for one or more users.
> We have another monitoring app (in-house developed) that, among other
> things, makes periodic checks of these NFS mounts.  It does so by
> forking and doing a simple "ls" command.  This is to ensure that these
> mounts are alive and well.  Sometimes, the "ls" command gets stuck to
> the point where it can't even be killed via "kill -9".  Only a reboot
> fixes it.  But the mount is only stuck for the user running the
> monitoring app.  Or sometimes the monitoring app is fine, but an
> actual user's processes will get stuck in "D" state (in top, means
> waiting on IO), but everyone else's jobs (and access to the kerberizes
> nfs shares) are OK.

And there's nothing in the logs, correct? Have you tried attaching strace
to one of those, and see if you can get a clue as to what's happening?