On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote:
The physicists and mathematicians who count there need high durations.
Yes. I too run HPC clusters and I have had uptimes of over 1000 days - clusters that are turned on when they are delivered and turned off when they are obsolete. It is crucial for long running calculations that you have a stable OS - you have never seen wrath like a computational scientist whose 200 day calculation has just failed because you needed to reboot the node it was running on.
I too was a HPC admin, and I knew people who believed the above, and their clusters were compromised. You're running a service where the weakest link are the researchers who use your cluster -- they're able to run code on your nodes, so local exploits are possible. They often have poor security practices (share passwords, use them for multiple accounts).
Also, if your researchers can't write code that performs checkpoints, they're going to be awfully unhappy when a bug in their code makes it segfault 199 days into a 200 day run.
Scheduled downtime and rolling cluster upgrades is a necessity of HPC cluster administration. I do wish that the ksplice/kpatch stuff was available in CentOS.