[CentOS] Thanks to every one

Tue Jul 18 14:20:15 UTC 2017
Valeri Galtsev <galtsev at kicp.uchicago.edu>

On Tue, July 18, 2017 8:01 am, Jonathan Billings wrote:
> On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote:
>> >
>> > The physicists and mathematicians who count there need high durations.
>>
>> Yes. I too run HPC clusters and I have had uptimes of over 1000 days -
>> clusters that are turned on when they are delivered and turned off when
>> they are obsolete. It is crucial for long running calculations that you
>> have a stable OS - you have never seen wrath like a computational
>> scientist whose 200 day calculation has just failed because you needed
>> to reboot the node it was running on.
>
> I too was a HPC admin, and I knew people who believed the above, and
> their clusters were compromised.  You're running a service where the
> weakest link are the researchers who use your cluster -- they're able
> to run code on your nodes, so local exploits are possible.  They often
> have poor security practices (share passwords, use them for multiple
> accounts).
>
> Also, if your researchers can't write code that performs checkpoints,
> they're going to be awfully unhappy when a bug in their code makes it
> segfault 199 days into a 200 day run.
>
> Scheduled downtime and rolling cluster upgrades is a necessity of
> HPC cluster administration.  I do wish that the ksplice/kpatch stuff
> was available in CentOS.

Thanks, Jonathan! Before your reply I had bad feeling that I'm the only
one in this World who still respects security considerations... The only
thing is: I still shy away from ksplice/kpatch, and do reboot machines
instead of patching running kernel on the fly.

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++