New subject: HPC SIG: IT4Innovations Ostrava - Supercomputer visit notes

19 Jun 2017


      Sharing notes from a visit in IT4Innovations center in Ostrava.
Supercomputer parameters info are available at [1] - mixture of Xeon, 
Xeon Phi and graphic card cores. Two clusters - Anselm runs RHEL 6, 
Salomon runs CentOS 6.
Each node runs on a specific piece of a hardware. If a project is built 
on a builder of a different hardware, it is built with different flags 
and configuration and the resulting binaries are not properly optimized. 
Some projects need special features of compilers in order to run 
efficiently. Some of the features are available only in the latest 
versions of compilers. By the time the latest versions get into 
builders, it is usually to late. Or a project needs to be built with a 
proprietary compiler that is not publicly available. Or given each 
project needs different version of system libraries in general, the HPC 
infrastructure needs to offer various versions of the same library.
The software needs to be available in many flavors. It means the same 
software to be built with multiple compilers and multiple versions. 
Different versions have different properties/features. Different 
compilers accent/provide different optimizations. Thus, a matrix of 
software to provide the users with. At the end a software needs to be 
built for the end-system architecture so it can use all the available 
instructions not to slow down the computation.
For  that reason (and many others) all the projects need to be built 
locally inside the cluster. That makes most of the binary packages in 
CentOS distribution unusable for the HPC use cases.Only usecase for rpm 
as proof of concepting on devel laptops is useful, not for final 
deployments.Currently, the EasyBuild project [2] is used as a 
replacement for rpm based spec files.Operators use Fedora upstream 
monitoring tool to monitor latest&greatest software (atm. ~400+ projects).
Infiniband is usedfor connection of the nodes, sometimes issues with 3rd 
party SW drivers (Bull/Atos and/or HPE).
Other notes:
- Lot of service providers are still running on CentOS 6, which blocks 
upgrade to CentOS 7- shutdown of cluster not possible, infra for 
clusters changed between RHEL 6 and RHEL 7.
- Puppet and Ansible used to deploy clusters (Ansible for disk-free 
nodes, Puppetfor nodes with disks). Still, each deployment is unique 
(e.g. unexpected situations) and thus not fully automated.For 
Ansiblepart,there is no roleusedfromthe AnsibleGalaxy - just Core 
modules and custom roles and playbooks
- Experiments with containers as well via Singularity [3] (Docker is not 
fully supported on CentOS 6, needs privileged user account)
-  Demand on packaging and providing tooling for HPC rather than 
libraries  themselves. If possible, provide full HPC stack that is 
upstream and distribution supported/maintained (including full stack 
upgrades). CI/CD  supported as well.
- HPC community is unfortunately security free,  security fix deployment 
can take several months, dependencies on  specific minor releases or 
kernel versions. Kernel KABI whitelist should  be advised to 3rd party 
vendors of drivers to prevent hard version deps
-  Each assigned set of nodes is expected to be vanilla new. Given it  
takes some time before a node is rebooted (order of minutes), all the 
tooling running inside a node must clean everything a user task left.  
Thus, minimize a number of times a node is really rebooted.
[1] https://docs.it4i.cz/salomon/hardware-overview/
[2] https://github.com/hpcugent/easybuild
[3] http://singularity.lbl.gov/