Re: [CentOS] kerberized-nfs - any experts out there?

22 Mar 2017

      Matt Garman wrote:
...
Is anyone on the list using kerberized-nfs on any kind of scale?
We use it here. I don't think I'm an expert - my manager is - but let me
think about your issues.
<snip>
...
Just to give a little insight into our issues: we have an
in-house-developed compute job dispatching system.  Say a user has
100s of analysis jobs he wants to run, he submits them to a central
master process, which in turn dispatches them to a "farm" of >100
compute nodes.  All these nodes have two different krb5p NFS mounts,
to which the jobs will read and write.  So while the users can
technically log in directly to the compute nodes, in practice they
never do.  The logins are only "implicit" when the job dispatching
system does a behind-the-scenes ssh to kick off these processes.
I would strongly recommend that you look into slurm. It's being used here
in both large and small scale, and is explicitly for that purpose.
...
Just to give some "flavor" to the kinds of issues we're facing, what
tends to crop up are one of three things:
(1) Random crashes.  These are full-on kernel trace dumps followed

by an automatic reboot.  This was really bad under CentOS 5.  A random
kernel upgrade magically fixed it.  It happens almost never under
CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
completely off CentOS 5 now, BTW.)
This may possibly be another issue.
...
(2) Permission denied issues.  I have user Kerberos tickets

configured for 70 days.  But there is clearly some kind of
undocumented kernel caching going on.  Looking at the Kerberos server
logs, it looks like it "could" be a performance issue, as I see 100s
of ticket requests within the same second when someone tries to launch
a lot of jobs.  Many of these will fail with "permission denied" but
if they immediately re-try, it works.  Related to this, I have been
unable to figure out what creates and deletes the
/tmp/krb5cc_uid_random files.
Are they asking for *new* credentials each time? They should only be doing
one kinit.
...
(3) Kerberized NFS shares getting "stuck" for one or more users.

We have another monitoring app (in-house developed) that, among other
things, makes periodic checks of these NFS mounts.  It does so by
forking and doing a simple "ls" command.  This is to ensure that these
mounts are alive and well.  Sometimes, the "ls" command gets stuck to
the point where it can't even be killed via "kill -9".  Only a reboot
fixes it.  But the mount is only stuck for the user running the
monitoring app.  Or sometimes the monitoring app is fine, but an
actual user's processes will get stuck in "D" state (in top, means
waiting on IO), but everyone else's jobs (and access to the kerberizes
nfs shares) are OK.
And there's nothing in the logs, correct? Have you tried attaching strace
to one of those, and see if you can get a clue as to what's happening?
<snip>
mark

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] kerberized-nfs - any experts out there?