On Mon, Jun 19, 2017 at 1:32 PM, Jan Chaloupka <jchaloup at redhat.com> wrote: > Sharing notes from a visit in IT4Innovations center in Ostrava. > > Supercomputer parameters info are available at [1] - mixture of Xeon, Xeon > Phi and graphic card cores. Two clusters - Anselm runs RHEL 6, Salomon runs > CentOS 6. > > Each node runs on a specific piece of a hardware. If a project is built on > a builder of a different hardware, it is built with different flags and > configuration and the resulting binaries are not properly optimized. Some > projects need special features of compilers in order to run efficiently. > Some of the features are available only in the latest versions of > compilers. By the time the latest versions get into builders, it is usually > to late. Or a project needs to be built with a proprietary compiler that is > not publicly available. Or given each project needs different version of > system libraries in general, the HPC infrastructure needs to offer various > versions of the same library. > The software needs to be available in many flavors. It means the same > software to be built with multiple compilers and multiple versions. > Different versions have different properties/features. Different compilers > accent/provide different optimizations. Thus, a matrix of software to > provide the users with. At the end a software needs to be built for the > end-system architecture so it can use all the available instructions not to > slow down the computation. > For that reason (and many others) all the projects need to be built > locally inside the cluster. That makes most of the binary packages in > CentOS distribution unusable for the HPC use cases. Only usecase for rpm as > proof of concepting on devel laptops is useful, not for final deployments. > Currently, the EasyBuild project [2] is used as a replacement for rpm based > spec files. Operators use Fedora upstream monitoring tool to monitor > latest&greatest software (atm. ~400+ projects). > Infiniband is used for connection of the nodes, sometimes issues with 3rd > party SW drivers (Bull/Atos and/or HPE). > > Other notes: > - Lot of service providers are still running on CentOS 6, which blocks > upgrade to CentOS 7 - shutdown of cluster not possible, infra for > clusters changed between RHEL 6 and RHEL 7. > - Puppet and Ansible used to deploy clusters (Ansible for disk-free nodes, > Puppet for nodes with disks). Still, each deployment is unique (e.g. > unexpected situations) and thus not fully automated. For Ansible part, > there is no role used from the Ansible Galaxy - just Core modules and > custom roles and playbooks > - Experiments with containers as well via Singularity [3] (Docker is not > fully supported on CentOS 6, needs privileged user account) > even if docker was supported, the kernel on the compute nodes of a cluster will stay fixed and old (due to e.g. infiniband support built-in). Won't that break containers in case someone creates a docker image assuming an access to a very recent kernel on the docker host? https://forums.docker.com/t/libc-incompatibilities-when-will-they-emerge/9895/4 Marcin > - Demand on packaging and providing tooling for HPC rather than > libraries themselves. If possible, provide full HPC stack that is upstream > and distribution supported/maintained (including full stack upgrades). > CI/CD supported as well. > - HPC community is unfortunately security free, security fix deployment > can take several months, dependencies on specific minor releases or kernel > versions. Kernel KABI whitelist should be advised to 3rd party vendors of > drivers to prevent hard version deps > - Each assigned set of nodes is expected to be vanilla new. Given it > takes some time before a node is rebooted (order of minutes), all the > tooling running inside a node must clean everything a user task left. > Thus, minimize a number of times a node is really rebooted. > > [1] https://docs.it4i.cz/salomon/hardware-overview/ > [2] https://github.com/hpcugent/easybuild > [3] http://singularity.lbl.gov/ > > _______________________________________________ > CentOS-devel mailing list > CentOS-devel at centos.org > https://lists.centos.org/mailman/listinfo/centos-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20170619/94f8f93e/attachment-0008.html>