Hi all,
We seem to have issue with an IBM Power 8 used within CI and so we have
to re-balance CI nodes that you can request through Duffy API for
ppc64/ppc64le.
My question sounds more like a survey : I think that most (if not all)
CI projects actually still building (and testing in CI) just target the
ppc64le architecture (Little Endian) and so not the ppc64 (Big Endian) one.
We'd like to hear from you and depending on the needs, we can
eventually drop ppc64 architecture for CI tests, and so have more
(re-balanced) ppc64le resources .
Opinions ?
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Hi!
To speed up some of the testing we do on bare-metal machines provisioned
through Duffy, I would like to pull pre-build images from the OpenShift
registry. The images are built through a BuildConfig and placed in an
ImageStream.
Now, it seems that the Duffy provisioned bare-metal systems can not pull
from the internal OpenShift registry:
[root@n46 ~]# podman pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test
Trying to pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test...
Get https://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:…: dial tcp 172.19.0.254:5000: connect: no route to host
Error: error pulling image "image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test": unable to pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test: unable to pull image: Error initializing source docker://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test: error pinging docker registry image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000: Get https://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:…: dial tcp 172.19.0.254:5000: connect: no route to host
I wonder if this is intentional, or if this is a little too strict? If
this can not be allowed through the firewall, what is the recommendation
to use these images? Maybe we should deploy our own registry and push
the images there...
Thanks!
Niels
Yesterday (Saturday) evening we got zabbix notifications that some nodes
in CI environment were unreachable. After a quick look, I discovered
that it was an embedded network switch in a chassis hosting multiple
nodes (including but not limited to jenkins node behind ci.centos.org)
that went nuts.
I tried a remote "hardware reset" and nodes were back online after ~10min.
But this morning (sunday), I see through zabbix that same issue happened
again, and in the hour after I already did the "hardware reset", but
this time, even that doesn't work anymore.
So that means that we have a network switch not working anymore.
As that chassis (like almost *all* equipment in CI) *isn't* under
warranty, we'll see on monday what can be done and how we give priority
to try to dispatch services elsewhere (and that probably means then
powering down other services , depending on priority that will be
given), but it's easy to understand that we can't even give any ETA at
this point.
Thanks for your comprehending,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
we had a kernel panic on the storage box used as nfs server for
openshift (both okd and ocp) and machine doesn't come back online due to
md device refusing to start.
machine is now in single-mode to analyze the situation and try to fix it.
We'll send more details and progress when possible
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Due to a hardware maintenance that needs to take place on the NFS
storage node used by openshift ("legacy" and current one - ocp ), we'll
have to shutdown the openshift cluster, and then proceed with hardware
maintenance on the NFS server (that itself needs to be powered down, no
way to actually do that "online")
Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC
time"""".
You can convert to local time with $(date -d '2020-09-30 12:00 UTC')
The expected "downtime" is estimated to ~60 minutes , time needed to
shutdown the machine, install new disks, restart the machine and also do
some updates and tuning on the setup.
For more informations about this, here are some relevant tickets that
were created for the perf issue in openshift and nfs :
https://pagure.io/centos-infra/issue/53https://pagure.io/centos-infra/issue/105https://pagure.io/centos-infra/issue/85https://pagure.io/centos-infra/issue/26
<subliminal message>
PS : worth noting that while we'll investigate reports on new ocp
cluster, we'll probably not spend time investigating in the old/legacy
one, that projects are supposed to migrate away from soon, as the legacy
openshift setup will disappear soon (see
https://pagure.io/centos-infra/issue/16)
</subliminal message>
Thanks for your comprehending and patience.
on behalf of the CI Infra team,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab