[CentOS-devel] HPC SIG: IT4Innovations Ostrava - Supercomputer visit notes

Mon Jun 19 22:23:23 UTC 2017
Mark Hahn <hahn at mcmaster.ca>

forgive me if this thread is considered OT - I think it contains some 
cross-cultural insights that may be valuable.

> Supercomputer parameters info are available at [1] - mixture of Xeon, Xeon 
> Phi and graphic card cores. Two clusters - Anselm runs RHEL 6, Salomon runs 
> CentOS 6.

It's quite common to find a mixture of node configurations - 
sometimes within a single cluster, sometimes partitioned.
Running a commercial Linux is relatively uncommon, though, since
most centres prefer Centos or perhaps Scientific Linux.  I haven't 
found vendor expertise particularly helpful in HPC situations,
especially considering that a non-small centre really must maintain 
its own skilled personnel.

> Each node runs on a specific piece of a hardware. If a project is built on a 
> builder of a different hardware, it is built with different flags and 
> configuration and the resulting binaries are not properly optimized. Some

This matters sometimes.  there is a continuum of codes, from not really 
caring about optimization (by which we mean machine-specific), to those 
where it makes a lot of difference.  A convenient middle is generic apps
that select optimized matrix libraries.  But for an app which is not 
vector-friendly, those things don't matter much.

> project needs to be built with a proprietary compiler that is not publicly 
> available. Or given each project needs different version of system libraries

The phenomenon is really that various tools/libraries/pipelines come from 
a variety of different development organizations, which vary in how
aggressively they pursue recent versions.  The worst are sloppy coders who
don't follow standards, so require extremely specific versions of many 
component packages ("only foo-3.14 is known to work").  That's the main motive
for containerization (and to some extent, solutions like environment modules
and Nix.)  There really is no such thing as "system libraries", when you 
view software this way...

> architecture so it can use all the available instructions not to slow down 
> the computation.

Besides vector length (ie, SSE vs AVX vs AVX-512), there has not been a lot 
of public, solid demonstrations that compiler- or flag-tweaking matters much.
The main scaling dimension is "more nodes", and may be driven by memory 
footprint as much as CPU speed, so few people sweat the flags to deliver 
some 7.34% improvement in single-core performance (in my experience.)

> - Puppet and Ansible used to deploy clusters (Ansible for disk-free nodes, 
> Puppetfor nodes with disks). Still, each deployment is unique (e.g.

I claim that HPC clusters do not benefit much from these tools: their 
main value is in handling widely diverse environments that need to change
frequently, whereas most HPC clusters are just rack after rack of the same
nodes that need to behave the same this year as last year.  (My organization 
uses OneSIS, which permits stateless nodes that run a readonly NFS-root;
to upgrade a package, you just "chroot /var/lib/oneSIS/image yum update foo".)

> - Experiments with containers as well via Singularity [3] (Docker is not 
> fully supported on CentOS 6, needs privileged user account)

The key point is that Docker is simply inappropriate for HPC, where 
the norm is a large, shared cluster, and jobs are not anything like 
Docker's raison d'etre (webservers, redis, etc).  In a sense, the normal
cluster scheduler is already automating resource management (memory, cores,
gpus) and jobs don't need eg exposed IP addresses.  Your storage is a 
quota in a many-PB /project filesystem, not dynamically provisioned S3
buckets.

> - HPC community is unfortunately security free,  security fix deployment can

That's a bit of an overstatement.  Most published fixes are simply
irrelevant, since very little desktop-ish user-space is even available
on compute nodes (firefox, for instance).  Kernel updates may be delayed 
if there are constraints in driver or out-of-tree filesystems.  Mitigation
(like simply disabling SCTP) is common.  But the concern for repeatability
makes it unattractive to apply, for instance, glibc updates as well.

> -  Each assigned set of nodes is expected to be vanilla new. Given it  takes

This varies between centres - we don't reboot compute nodes voluntarily,
and a single node may be shared by jobs from multiple users.  You trade 
increased utilization/efficiency versus some exposure to security and 
performance interference.  It's really a question of how diverse your 
user-base is...

regards, mark hahn.