[CentOS] Thanks to every one

Tue Jul 18 14:20:15 UTC 2017
Valeri Galtsev <galtsev at kicp.uchicago.edu>

On Tue, July 18, 2017 8:01 am, Jonathan Billings wrote:
> On Sun, Jul 16, 2017 at 06:02:15PM +0100, Pete Biggs wrote:
>> >
>> > The physicists and mathematicians who count there need high durations.
>> Yes. I too run HPC clusters and I have had uptimes of over 1000 days -
>> clusters that are turned on when they are delivered and turned off when
>> they are obsolete. It is crucial for long running calculations that you
>> have a stable OS - you have never seen wrath like a computational
>> scientist whose 200 day calculation has just failed because you needed
>> to reboot the node it was running on.
> I too was a HPC admin, and I knew people who believed the above, and
> their clusters were compromised.  You're running a service where the
> weakest link are the researchers who use your cluster -- they're able
> to run code on your nodes, so local exploits are possible.  They often
> have poor security practices (share passwords, use them for multiple
> accounts).
> Also, if your researchers can't write code that performs checkpoints,
> they're going to be awfully unhappy when a bug in their code makes it
> segfault 199 days into a 200 day run.
> Scheduled downtime and rolling cluster upgrades is a necessity of
> HPC cluster administration.  I do wish that the ksplice/kpatch stuff
> was available in CentOS.

Thanks, Jonathan! Before your reply I had bad feeling that I'm the only
one in this World who still respects security considerations... The only
thing is: I still shy away from ksplice/kpatch, and do reboot machines
instead of patching running kernel on the fly.


Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247