[CentOS-devel] HPC SIG: IT4Innovations Ostrava - Supercomputer visit notes

Mon Jun 19 12:12:32 UTC 2017
Marcin Dulak <marcin.dulak at gmail.com>

On Mon, Jun 19, 2017 at 1:32 PM, Jan Chaloupka <jchaloup at redhat.com> wrote:

> Sharing notes from a visit in IT4Innovations center in Ostrava.
>
> Supercomputer parameters info are available at [1] - mixture of Xeon, Xeon
> Phi and graphic card cores. Two clusters - Anselm runs RHEL 6, Salomon runs
> CentOS 6.
>
> Each node runs on a specific piece of a hardware. If a project is built on
> a builder of a different hardware, it is built with different flags and
> configuration and the resulting binaries are not properly optimized. Some
> projects need special features of compilers in order to run efficiently.
> Some of the features are available only in the latest versions of
> compilers. By the time the latest versions get into builders, it is usually
> to late. Or a project needs to be built with a proprietary compiler that is
> not publicly available. Or given each project needs different version of
> system libraries in general, the HPC infrastructure needs to offer various
> versions of the same library.
> The software needs to be available in many flavors. It means the same
> software to be built with multiple compilers and multiple versions.
> Different versions have different properties/features. Different compilers
> accent/provide different optimizations. Thus, a matrix of software to
> provide the users with. At the end a software needs to be built for the
> end-system architecture so it can use all the available instructions not to
> slow down the computation.
> For  that reason (and many others) all the projects need to be built
> locally inside the cluster. That makes most of the binary packages in
> CentOS distribution unusable for the HPC use cases. Only usecase for rpm as
> proof of concepting on devel laptops is useful, not for final deployments.
> Currently, the EasyBuild project [2] is used as a replacement for rpm based
> spec files. Operators use Fedora upstream monitoring tool to monitor
> latest&greatest software (atm. ~400+ projects).
> Infiniband is used for connection of the nodes, sometimes issues with 3rd
> party SW drivers (Bull/Atos and/or HPE).
>
> Other notes:
> - Lot of service providers are still running on CentOS 6, which blocks
> upgrade to CentOS 7 - shutdown of cluster not possible, infra for
> clusters changed between RHEL 6 and RHEL 7.
> - Puppet and Ansible used to deploy clusters (Ansible for disk-free nodes,
> Puppet for nodes with disks). Still, each deployment is unique (e.g.
> unexpected situations) and thus not fully automated. For Ansible part,
> there is no role used from the Ansible Galaxy - just Core modules and
> custom roles and playbooks
> - Experiments with containers as well via Singularity [3] (Docker is not
> fully supported on CentOS 6, needs privileged user account)
>

even if docker was supported, the kernel on the compute nodes of a cluster
will stay fixed and old (due to e.g. infiniband support built-in).
Won't that break containers in case someone creates a docker image assuming
an access to a very recent kernel on the docker host?
https://forums.docker.com/t/libc-incompatibilities-when-will-they-emerge/9895/4

Marcin


> -  Demand on packaging and providing tooling for HPC rather than
> libraries  themselves. If possible, provide full HPC stack that is upstream
> and  distribution supported/maintained (including full stack upgrades).
> CI/CD  supported as well.
> - HPC community is unfortunately security free,  security fix deployment
> can take several months, dependencies on  specific minor releases or kernel
> versions. Kernel KABI whitelist should  be advised to 3rd party vendors of
> drivers to prevent hard version deps
> -  Each assigned set of nodes is expected to be vanilla new. Given it
> takes some time before a node is rebooted (order of minutes), all the
> tooling running inside a node must clean everything a user task left.
> Thus, minimize a number of times a node is really rebooted.
>
> [1] https://docs.it4i.cz/salomon/hardware-overview/
> [2] https://github.com/hpcugent/easybuild
> [3] http://singularity.lbl.gov/
>
> _______________________________________________
> CentOS-devel mailing list
> CentOS-devel at centos.org
> https://lists.centos.org/mailman/listinfo/centos-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20170619/94f8f93e/attachment-0008.html>