Is anyone on the list using kerberized-nfs on any kind of scale?
I've been fighting with this for years. In general, when we have issues with this system, they are random and/or not repeatable. I've had very little luck with community support. I hope I don't offend by saying that! Rather, my belief is that these problems are very niche/esoteric, and so beyond the scope of typical community support. But I'd be delighted to be proven wrong!
So this is more of a "meta" question: anyone out there have any general recommendations for how to get support on what I presume are niche problems specific to our environment? How is paid upstream support?
Just to give a little insight into our issues: we have an in-house-developed compute job dispatching system. Say a user has 100s of analysis jobs he wants to run, he submits them to a central master process, which in turn dispatches them to a "farm" of >100 compute nodes. All these nodes have two different krb5p NFS mounts, to which the jobs will read and write. So while the users can technically log in directly to the compute nodes, in practice they never do. The logins are only "implicit" when the job dispatching system does a behind-the-scenes ssh to kick off these processes.
Just to give some "flavor" to the kinds of issues we're facing, what tends to crop up are one of three things:
(1) Random crashes. These are full-on kernel trace dumps followed by an automatic reboot. This was really bad under CentOS 5. A random kernel upgrade magically fixed it. It happens almost never under CentOS 6. But happens fairly frequently under CentOS 7. (We're completely off CentOS 5 now, BTW.)
(2) Permission denied issues. I have user Kerberos tickets configured for 70 days. But there is clearly some kind of undocumented kernel caching going on. Looking at the Kerberos server logs, it looks like it "could" be a performance issue, as I see 100s of ticket requests within the same second when someone tries to launch a lot of jobs. Many of these will fail with "permission denied" but if they immediately re-try, it works. Related to this, I have been unable to figure out what creates and deletes the /tmp/krb5cc_uid_random files.
(3) Kerberized NFS shares getting "stuck" for one or more users. We have another monitoring app (in-house developed) that, among other things, makes periodic checks of these NFS mounts. It does so by forking and doing a simple "ls" command. This is to ensure that these mounts are alive and well. Sometimes, the "ls" command gets stuck to the point where it can't even be killed via "kill -9". Only a reboot fixes it. But the mount is only stuck for the user running the monitoring app. Or sometimes the monitoring app is fine, but an actual user's processes will get stuck in "D" state (in top, means waiting on IO), but everyone else's jobs (and access to the kerberizes nfs shares) are OK.
This is actually blocking us from upgrading to CentOS 7. But my colleagues and I are at a loss how to solve this. So this post is really more of a semi-desperate plea for any kind of advice. What other resources might we consider? Paid support is not out of the question (within reason). Are there any "super specialist" consultants out there who deal in Kerberized NFS?
Thanks! Matt
Matt Garman wrote:
Is anyone on the list using kerberized-nfs on any kind of scale?
We use it here. I don't think I'm an expert - my manager is - but let me think about your issues. <snip>
Just to give a little insight into our issues: we have an in-house-developed compute job dispatching system. Say a user has 100s of analysis jobs he wants to run, he submits them to a central master process, which in turn dispatches them to a "farm" of >100 compute nodes. All these nodes have two different krb5p NFS mounts, to which the jobs will read and write. So while the users can technically log in directly to the compute nodes, in practice they never do. The logins are only "implicit" when the job dispatching system does a behind-the-scenes ssh to kick off these processes.
I would strongly recommend that you look into slurm. It's being used here in both large and small scale, and is explicitly for that purpose.
Just to give some "flavor" to the kinds of issues we're facing, what tends to crop up are one of three things:
(1) Random crashes. These are full-on kernel trace dumps followed
by an automatic reboot. This was really bad under CentOS 5. A random kernel upgrade magically fixed it. It happens almost never under CentOS 6. But happens fairly frequently under CentOS 7. (We're completely off CentOS 5 now, BTW.)
This may possibly be another issue.
(2) Permission denied issues. I have user Kerberos tickets
configured for 70 days. But there is clearly some kind of undocumented kernel caching going on. Looking at the Kerberos server logs, it looks like it "could" be a performance issue, as I see 100s of ticket requests within the same second when someone tries to launch a lot of jobs. Many of these will fail with "permission denied" but if they immediately re-try, it works. Related to this, I have been unable to figure out what creates and deletes the /tmp/krb5cc_uid_random files.
Are they asking for *new* credentials each time? They should only be doing one kinit.
(3) Kerberized NFS shares getting "stuck" for one or more users.
We have another monitoring app (in-house developed) that, among other things, makes periodic checks of these NFS mounts. It does so by forking and doing a simple "ls" command. This is to ensure that these mounts are alive and well. Sometimes, the "ls" command gets stuck to the point where it can't even be killed via "kill -9". Only a reboot fixes it. But the mount is only stuck for the user running the monitoring app. Or sometimes the monitoring app is fine, but an actual user's processes will get stuck in "D" state (in top, means waiting on IO), but everyone else's jobs (and access to the kerberizes nfs shares) are OK.
And there's nothing in the logs, correct? Have you tried attaching strace to one of those, and see if you can get a clue as to what's happening? <snip>
mark
On Wed, Mar 22, 2017 at 3:19 PM, m.roth@5-cent.us wrote:
Matt Garman wrote:
(2) Permission denied issues. I have user Kerberos tickets
configured for 70 days. But there is clearly some kind of undocumented kernel caching going on. Looking at the Kerberos server logs, it looks like it "could" be a performance issue, as I see 100s of ticket requests within the same second when someone tries to launch a lot of jobs. Many of these will fail with "permission denied" but if they immediately re-try, it works. Related to this, I have been unable to figure out what creates and deletes the /tmp/krb5cc_uid_random files.
Are they asking for *new* credentials each time? They should only be doing one kinit.
Well, that's what I don't understand. In practice, I don't believe a user should ever have to explicitly do kinit, as their credentials/tickets are implicitly created (and forwarded) via ssh. Despite that, I see the /tmp/krb5cc_uid files accumulating over time. But I've tried testing this, and I haven't been able to determine exactly what creates those files. And I don't understand why new krb5cc_uid files are created when there is an existing, valid file already. Clearly some programs ignore existing files, and some create new ones.
And there's nothing in the logs, correct? Have you tried attaching strace to one of those, and see if you can get a clue as to what's happening?
Actually, I get this in the log:
Mar 22 13:25:09 daemon.err lnxdev108 rpc.gssd[19329]: WARNING: handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5'
Thanks, Matt
Feel free to contact me offline if you wish. I'll just go on record as saying that it's a bear
----- On 22 Mar, 2017, at 12:26, Matt Garman matthew.garman@gmail.com wrote:
| Is anyone on the list using kerberized-nfs on any kind of scale? | | I've been fighting with this for years. In general, when we have | issues with this system, they are random and/or not repeatable. I've | had very little luck with community support. I hope I don't offend by | saying that! Rather, my belief is that these problems are very | niche/esoteric, and so beyond the scope of typical community support. | But I'd be delighted to be proven wrong! | | So this is more of a "meta" question: anyone out there have any | general recommendations for how to get support on what I presume are | niche problems specific to our environment? How is paid upstream | support? | | Just to give a little insight into our issues: we have an | in-house-developed compute job dispatching system. Say a user has | 100s of analysis jobs he wants to run, he submits them to a central | master process, which in turn dispatches them to a "farm" of >100 | compute nodes. All these nodes have two different krb5p NFS mounts, | to which the jobs will read and write. So while the users can | technically log in directly to the compute nodes, in practice they | never do. The logins are only "implicit" when the job dispatching | system does a behind-the-scenes ssh to kick off these processes. | | Just to give some "flavor" to the kinds of issues we're facing, what | tends to crop up are one of three things: | | (1) Random crashes. These are full-on kernel trace dumps followed | by an automatic reboot. This was really bad under CentOS 5. A random | kernel upgrade magically fixed it. It happens almost never under | CentOS 6. But happens fairly frequently under CentOS 7. (We're | completely off CentOS 5 now, BTW.) | | (2) Permission denied issues. I have user Kerberos tickets | configured for 70 days. But there is clearly some kind of | undocumented kernel caching going on. Looking at the Kerberos server | logs, it looks like it "could" be a performance issue, as I see 100s | of ticket requests within the same second when someone tries to launch | a lot of jobs. Many of these will fail with "permission denied" but | if they immediately re-try, it works. Related to this, I have been | unable to figure out what creates and deletes the | /tmp/krb5cc_uid_random files. | | (3) Kerberized NFS shares getting "stuck" for one or more users. | We have another monitoring app (in-house developed) that, among other | things, makes periodic checks of these NFS mounts. It does so by | forking and doing a simple "ls" command. This is to ensure that these | mounts are alive and well. Sometimes, the "ls" command gets stuck to | the point where it can't even be killed via "kill -9". Only a reboot | fixes it. But the mount is only stuck for the user running the | monitoring app. Or sometimes the monitoring app is fine, but an | actual user's processes will get stuck in "D" state (in top, means | waiting on IO), but everyone else's jobs (and access to the kerberizes | nfs shares) are OK. | | This is actually blocking us from upgrading to CentOS 7. But my | colleagues and I are at a loss how to solve this. So this post is | really more of a semi-desperate plea for any kind of advice. What | other resources might we consider? Paid support is not out of the | question (within reason). Are there any "super specialist" | consultants out there who deal in Kerberized NFS? | | Thanks! | Matt | _______________________________________________ | CentOS mailing list | CentOS@centos.org | https://lists.centos.org/mailman/listinfo/centos
On 03/22/2017 03:26 PM, Matt Garman wrote:
Is anyone on the list using kerberized-nfs on any kind of scale?
Not for a good many years.
Are you using v3 or v4 NFS?
Also, you can probably stuff the rpc.gss* and idmapd services into verbose mode, which may give you a better ideas as to whats going on.
And yes, the kernel does some kerberos caching. I think 10 to 15 minutes.
On Wed, Mar 22, 2017 at 6:11 PM, John Jasen jjasen@realityfailure.org wrote:
On 03/22/2017 03:26 PM, Matt Garman wrote:
Is anyone on the list using kerberized-nfs on any kind of scale?
Not for a good many years.
Are you using v3 or v4 NFS?
v4. I think you can only do kerberized NFS with v4.
Also, you can probably stuff the rpc.gss* and idmapd services into verbose mode, which may give you a better ideas as to whats going on.
I do that. The logs are verbose, but generally too cryptic for me to make sense of. Web searches on the errors yield results at best 50% of the time, and the hits almost never have a solution.
And yes, the kernel does some kerberos caching. I think 10 to 15 minutes.
To me it looks like it's more on the order of an hour. For example, a simple test I've done is to do a "fresh" login on a server. The server has just been rebooted, and with the reboot, all the /tmp/krb5cc* files were deleted.
I login via ssh, which implicitly establishes my Kerberos tickets. I deliberately do a "kdestroy". Then I have a simple shell loop like this:
while [ 1 ] ; do date ; ls ; sleep 30s ; done
Which is just doing an ls on my home directory, which is a kerberized NFS mount. Despite having done a kdestroy, this works, presumably from cached credentials. And it continues to work for *about* an hour, and then I start getting permission denied. I emphasized "about" because it's not precisely one hour, but seems to range from maybe 55 to 65 minutes.
But, that's a super-simple, controlled test. What happens when you add screen multiplexers (tmux, gnu screen) into the mix. What if you login "fresh" via password versus having your gss (kerberos) credentials forwarded? What if you're logged in multiple times on the same machine by via different methods?