On Tue, July 18, 2017 8:01 am, Jonathan Billings wrote: > On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote: >> > >> > The physicists and mathematicians who count there need high durations. >> >> Yes. I too run HPC clusters and I have had uptimes of over 1000 days - >> clusters that are turned on when they are delivered and turned off when >> they are obsolete. It is crucial for long running calculations that you >> have a stable OS - you have never seen wrath like a computational >> scientist whose 200 day calculation has just failed because you needed >> to reboot the node it was running on. > > I too was a HPC admin, and I knew people who believed the above, and > their clusters were compromised. You're running a service where the > weakest link are the researchers who use your cluster -- they're able > to run code on your nodes, so local exploits are possible. They often > have poor security practices (share passwords, use them for multiple > accounts). > > Also, if your researchers can't write code that performs checkpoints, > they're going to be awfully unhappy when a bug in their code makes it > segfault 199 days into a 200 day run. > > Scheduled downtime and rolling cluster upgrades is a necessity of > HPC cluster administration. I do wish that the ksplice/kpatch stuff > was available in CentOS. Thanks, Jonathan! Before your reply I had bad feeling that I'm the only one in this World who still respects security considerations... The only thing is: I still shy away from ksplice/kpatch, and do reboot machines instead of patching running kernel on the fly. Valeri ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++