[CentOS] kerberized-nfs - any experts out there?
James A. Peltier
jpeltier at sfu.ca
Wed Mar 22 21:25:27 UTC 2017
Feel free to contact me offline if you wish. I'll just go on record as saying that it's a bear
----- On 22 Mar, 2017, at 12:26, Matt Garman matthew.garman at gmail.com wrote:
| Is anyone on the list using kerberized-nfs on any kind of scale?
|
| I've been fighting with this for years. In general, when we have
| issues with this system, they are random and/or not repeatable. I've
| had very little luck with community support. I hope I don't offend by
| saying that! Rather, my belief is that these problems are very
| niche/esoteric, and so beyond the scope of typical community support.
| But I'd be delighted to be proven wrong!
|
| So this is more of a "meta" question: anyone out there have any
| general recommendations for how to get support on what I presume are
| niche problems specific to our environment? How is paid upstream
| support?
|
| Just to give a little insight into our issues: we have an
| in-house-developed compute job dispatching system. Say a user has
| 100s of analysis jobs he wants to run, he submits them to a central
| master process, which in turn dispatches them to a "farm" of >100
| compute nodes. All these nodes have two different krb5p NFS mounts,
| to which the jobs will read and write. So while the users can
| technically log in directly to the compute nodes, in practice they
| never do. The logins are only "implicit" when the job dispatching
| system does a behind-the-scenes ssh to kick off these processes.
|
| Just to give some "flavor" to the kinds of issues we're facing, what
| tends to crop up are one of three things:
|
| (1) Random crashes. These are full-on kernel trace dumps followed
| by an automatic reboot. This was really bad under CentOS 5. A random
| kernel upgrade magically fixed it. It happens almost never under
| CentOS 6. But happens fairly frequently under CentOS 7. (We're
| completely off CentOS 5 now, BTW.)
|
| (2) Permission denied issues. I have user Kerberos tickets
| configured for 70 days. But there is clearly some kind of
| undocumented kernel caching going on. Looking at the Kerberos server
| logs, it looks like it "could" be a performance issue, as I see 100s
| of ticket requests within the same second when someone tries to launch
| a lot of jobs. Many of these will fail with "permission denied" but
| if they immediately re-try, it works. Related to this, I have been
| unable to figure out what creates and deletes the
| /tmp/krb5cc_uid_random files.
|
| (3) Kerberized NFS shares getting "stuck" for one or more users.
| We have another monitoring app (in-house developed) that, among other
| things, makes periodic checks of these NFS mounts. It does so by
| forking and doing a simple "ls" command. This is to ensure that these
| mounts are alive and well. Sometimes, the "ls" command gets stuck to
| the point where it can't even be killed via "kill -9". Only a reboot
| fixes it. But the mount is only stuck for the user running the
| monitoring app. Or sometimes the monitoring app is fine, but an
| actual user's processes will get stuck in "D" state (in top, means
| waiting on IO), but everyone else's jobs (and access to the kerberizes
| nfs shares) are OK.
|
| This is actually blocking us from upgrading to CentOS 7. But my
| colleagues and I are at a loss how to solve this. So this post is
| really more of a semi-desperate plea for any kind of advice. What
| other resources might we consider? Paid support is not out of the
| question (within reason). Are there any "super specialist"
| consultants out there who deal in Kerberized NFS?
|
| Thanks!
| Matt
| _______________________________________________
| CentOS mailing list
| CentOS at centos.org
| https://lists.centos.org/mailman/listinfo/centos
--
James A. Peltier
IT Services - Research Computing Group
Simon Fraser University - Burnaby Campus
Phone : 604-365-6432
Fax : 778-782-3045
E-Mail : jpeltier at sfu.ca
Website : http://www.sfu.ca/itservices
Twitter : @sfu_rcg
Powering Engagement Through Technology
More information about the CentOS
mailing list