Feel free to contact me offline if you wish. I'll just go on record as saying that it's a bear
----- On 22 Mar, 2017, at 12:26, Matt Garman matthew.garman@gmail.com wrote:
| Is anyone on the list using kerberized-nfs on any kind of scale? | | I've been fighting with this for years. In general, when we have | issues with this system, they are random and/or not repeatable. I've | had very little luck with community support. I hope I don't offend by | saying that! Rather, my belief is that these problems are very | niche/esoteric, and so beyond the scope of typical community support. | But I'd be delighted to be proven wrong! | | So this is more of a "meta" question: anyone out there have any | general recommendations for how to get support on what I presume are | niche problems specific to our environment? How is paid upstream | support? | | Just to give a little insight into our issues: we have an | in-house-developed compute job dispatching system. Say a user has | 100s of analysis jobs he wants to run, he submits them to a central | master process, which in turn dispatches them to a "farm" of >100 | compute nodes. All these nodes have two different krb5p NFS mounts, | to which the jobs will read and write. So while the users can | technically log in directly to the compute nodes, in practice they | never do. The logins are only "implicit" when the job dispatching | system does a behind-the-scenes ssh to kick off these processes. | | Just to give some "flavor" to the kinds of issues we're facing, what | tends to crop up are one of three things: | | (1) Random crashes. These are full-on kernel trace dumps followed | by an automatic reboot. This was really bad under CentOS 5. A random | kernel upgrade magically fixed it. It happens almost never under | CentOS 6. But happens fairly frequently under CentOS 7. (We're | completely off CentOS 5 now, BTW.) | | (2) Permission denied issues. I have user Kerberos tickets | configured for 70 days. But there is clearly some kind of | undocumented kernel caching going on. Looking at the Kerberos server | logs, it looks like it "could" be a performance issue, as I see 100s | of ticket requests within the same second when someone tries to launch | a lot of jobs. Many of these will fail with "permission denied" but | if they immediately re-try, it works. Related to this, I have been | unable to figure out what creates and deletes the | /tmp/krb5cc_uid_random files. | | (3) Kerberized NFS shares getting "stuck" for one or more users. | We have another monitoring app (in-house developed) that, among other | things, makes periodic checks of these NFS mounts. It does so by | forking and doing a simple "ls" command. This is to ensure that these | mounts are alive and well. Sometimes, the "ls" command gets stuck to | the point where it can't even be killed via "kill -9". Only a reboot | fixes it. But the mount is only stuck for the user running the | monitoring app. Or sometimes the monitoring app is fine, but an | actual user's processes will get stuck in "D" state (in top, means | waiting on IO), but everyone else's jobs (and access to the kerberizes | nfs shares) are OK. | | This is actually blocking us from upgrading to CentOS 7. But my | colleagues and I are at a loss how to solve this. So this post is | really more of a semi-desperate plea for any kind of advice. What | other resources might we consider? Paid support is not out of the | question (within reason). Are there any "super specialist" | consultants out there who deal in Kerberized NFS? | | Thanks! | Matt | _______________________________________________ | CentOS mailing list | CentOS@centos.org | https://lists.centos.org/mailman/listinfo/centos