On Tue, 18 Jul 2017 09:01:07 -0400 Jonathan Billings billings@negate.org wrote:
On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote:
The physicists and mathematicians who count there need high durations.
Yes. I too run HPC clusters and I have had uptimes of over 1000 days - clusters that are turned on when they are delivered and turned off when they are obsolete. It is crucial for long running calculations that you have a stable OS - you have never seen wrath like a computational scientist whose 200 day calculation has just failed because you needed to reboot the node it was running on.
I too was a HPC admin, and I knew people who believed the above, and their clusters were compromised. You're running a service where the weakest link are the researchers who use your cluster -- they're able to run code on your nodes, so local exploits are possible. They often have poor security practices (share passwords, use them for multiple accounts).
I work at a quite large hpc site and fully agree.
HPC resources need possibly more smart and active security work than your average server.
With 1000+ users that can compile and run jobs and get their credentials misplaced etc. we typically move even faster than CentOS updates to fix/half-patch/mitigate security vulnerabilities.
/Peter