<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div id="magicdomid4" class=""><span
class="author-g-312iv59qbixk1uuz122z">Sharing notes from a visit
in IT4Innovations center in Ostrava.<br>
<br>
Supercomputer parameters info are available at [1] - mixture of
Xeon, Xeon Phi and graphic card cores. Two clusters - Anselm
runs RHEL 6, Salomon runs CentOS 6.</span></div>
<div id="magicdomid5" class=""><br>
</div>
<div id="magicdomid164" class="ace-line"><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">Each node runs on a
specific piece of a hardware. If a project is built on a builder
of a different hardware, it is built with different flags and
configuration and the resulting binaries are not properly
optimized. Some projects need special features of compilers in
order to run efficiently. Some of the features are available
only in the latest versions of compilers. By the time the latest
versions get into builders, it is usually to late. Or a project
needs to be built with a proprietary compiler that is not
publicly available. Or given each project needs different
version of system libraries in general, the HPC infrastructure
needs to offer various versions of the same library. </span></div>
<div id="magicdomid165" class="ace-line"><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">The software needs to
be available in many flavors. It means the same software to be
built with multiple compilers and multiple versions. Different
versions have different properties/features. Different compilers
accent/provide different optimizations. Thus, a matrix of
software to provide the users with. At the end a software needs
to be built for the end-system architecture so it can use all
the available instructions not to slow down the computation.</span></div>
<div id="magicdomid167" class="ace-line"><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">For that reason (and
many others) all the projects need to be built locally inside
the cluster. That makes most of the binary packages in CentOS
distribution unusable for the HPC use cases.</span><span
class="author-g-312iv59qbixk1uuz122z"> Only usecase for rpm </span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">a</span><span
class="author-g-312iv59qbixk1uuz122z">s proof of concepting on
devel laptops is useful, not for final deployments.</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc"> Currently, the
EasyBuild project [2] is used as a replacement for rpm based
spec files.</span><span class="author-g-312iv59qbixk1uuz122z"> </span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">Operators </span><span
class="author-g-312iv59qbixk1uuz122z">use Fedora upstream
monitoring tool to monitor latest&greatest software (atm.
~400+ projects). </span></div>
<div id="magicdomid8" class=""><span
class="author-g-312iv59qbixk1uuz122z">Infiniband is use</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">d</span><span
class="author-g-312iv59qbixk1uuz122z"> for connection of the
nodes, sometimes issues with 3rd party SW drivers (Bull/Atos
and/or HPE).</span></div>
<div id="magicdomid9" class=""><br>
</div>
<div id="magicdomid10" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">Other notes:</span></div>
<div id="magicdomid11" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- Lot of service
providers are still running on CentOS 6, which blocks upgrade to
CentOS 7</span><span class="author-g-312iv59qbixk1uuz122z"> -
shutdown of cluster not possible, infra for clusters changed
between RHEL 6 and RHEL 7</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">.</span></div>
<div id="magicdomid12" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- Puppet and Ansible
used to deploy clusters (Ansible for disk-free nodes, Puppe</span><span
class="author-g-312iv59qbixk1uuz122z">t</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc"> for nodes with
disks). Still, each deployment is unique (e.g. unexpected
situations) and thus not fully automated.</span><span
class="author-g-312iv59qbixk1uuz122z"> For Ansible</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc"> part</span><span
class="author-g-312iv59qbixk1uuz122z">,</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc"> there is no role</span><span
class="author-g-312iv59qbixk1uuz122z"> use</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">d</span><span
class="author-g-312iv59qbixk1uuz122z"> from</span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc"> the Ansible</span><span
class="author-g-312iv59qbixk1uuz122z"> Galaxy - just Core
modules and custom roles and playbooks</span></div>
<div id="magicdomid68" class="ace-line"><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- Experiments with
containers as well via Singularity [3] (Docker is not fully
supported on CentOS 6, needs privileged user account)</span></div>
<div id="magicdomid14" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- Demand on packaging
and providing tooling for HPC rather than libraries themselves.
If possible, provide full HPC stack that is upstream and
distribution supported/maintained (including full stack
upgrades). CI/CD supported as well.</span></div>
<div id="magicdomid15" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- HPC community is
unfortunately security free</span><span
class="author-g-312iv59qbixk1uuz122z">, security fix deployment
can take several months, dependencies on specific minor
releases or kernel versions. Kernel KABI whitelist should be
advised to 3rd party vendors of drivers to prevent hard version
deps</span></div>
<div id="magicdomid16" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">- Each assigned set
of nodes is expected to be vanilla new. Given it takes some
time before a node is rebooted (order of minutes), all the
tooling running inside a node must clean everything a user task
left. Thus, minimize a number of times a node is really
rebooted.</span></div>
<div id="magicdomid18" class=""><br>
</div>
<div id="magicdomid19" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">[1] </span><span
class="author-g-312iv59qbixk1uuz122z url"><a
href="https://docs.it4i.cz/salomon/hardware-overview/">https://docs.it4i.cz/salomon/hardware-overview/</a></span></div>
<div id="magicdomid20" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">[2] </span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc url"><a
href="https://github.com/hpcugent/easybuild">https://github.com/hpcugent/easybuild</a></span></div>
<div id="magicdomid21" class=""><span
class="author-g-eeoa7mbyz122z3pz122z5kuc">[3] </span><span
class="author-g-eeoa7mbyz122z3pz122z5kuc url"><a
href="http://singularity.lbl.gov/">http://singularity.lbl.gov/</a></span></div>
</body>
</html>