Matt Garman wrote:
Is anyone on the list using kerberized-nfs on any kind of scale?
We use it here. I don't think I'm an expert - my manager is - but let me think about your issues. <snip>
Just to give a little insight into our issues: we have an in-house-developed compute job dispatching system. Say a user has 100s of analysis jobs he wants to run, he submits them to a central master process, which in turn dispatches them to a "farm" of >100 compute nodes. All these nodes have two different krb5p NFS mounts, to which the jobs will read and write. So while the users can technically log in directly to the compute nodes, in practice they never do. The logins are only "implicit" when the job dispatching system does a behind-the-scenes ssh to kick off these processes.
I would strongly recommend that you look into slurm. It's being used here in both large and small scale, and is explicitly for that purpose.
Just to give some "flavor" to the kinds of issues we're facing, what tends to crop up are one of three things:
(1) Random crashes. These are full-on kernel trace dumps followed
by an automatic reboot. This was really bad under CentOS 5. A random kernel upgrade magically fixed it. It happens almost never under CentOS 6. But happens fairly frequently under CentOS 7. (We're completely off CentOS 5 now, BTW.)
This may possibly be another issue.
(2) Permission denied issues. I have user Kerberos tickets
configured for 70 days. But there is clearly some kind of undocumented kernel caching going on. Looking at the Kerberos server logs, it looks like it "could" be a performance issue, as I see 100s of ticket requests within the same second when someone tries to launch a lot of jobs. Many of these will fail with "permission denied" but if they immediately re-try, it works. Related to this, I have been unable to figure out what creates and deletes the /tmp/krb5cc_uid_random files.
Are they asking for *new* credentials each time? They should only be doing one kinit.
(3) Kerberized NFS shares getting "stuck" for one or more users.
We have another monitoring app (in-house developed) that, among other things, makes periodic checks of these NFS mounts. It does so by forking and doing a simple "ls" command. This is to ensure that these mounts are alive and well. Sometimes, the "ls" command gets stuck to the point where it can't even be killed via "kill -9". Only a reboot fixes it. But the mount is only stuck for the user running the monitoring app. Or sometimes the monitoring app is fine, but an actual user's processes will get stuck in "D" state (in top, means waiting on IO), but everyone else's jobs (and access to the kerberizes nfs shares) are OK.
And there's nothing in the logs, correct? Have you tried attaching strace to one of those, and see if you can get a clue as to what's happening? <snip>
mark