Hi folks,
We are planning to update all the plugins (compatible) installed on
ci.centos.org jenkins instance.
I will do it tomorrow morning (Dec 18th) at 9am UTC. I will start
preparing for shutdown then and all the jobs triggered post that will
go in the queue until the instance has updated the plugin and
restarted.
I would also like to open a question to you all who are using OCP4
cluster. How would you want to manage Jenkins update for your
namespace? There are multiple ways from them being auto updated
whenever we update the cluster (this is by default that we use),
time'd/self trigger updates, or updates on change in tags.
I am interested in hearing your thoughts.
Thank you
--
Vipul Siddharth
He/His/Him
Fedora | CentOS CI Infrastructure Team
Due to some network switches upgrade in the DC hosting some community
projects (including but not limited to CentOS), we'll have a large
majority of our infra not reachable.
Migration is scheduled for """"Tuesday November 10th, 2:00 am UTC time"""".
You can convert to local time with $(date -d '2020-11-10 14:00 UTC')
We unfortunately can't announce/give you any expected downtime as it can
last for several hours (info I received through invite) but we'll try to
restore all services/connectivity as soon as possible.
Impacted services in that DC :
- *all*
Non impacted services (easier to just mention short list of things not
in that DC, so items not listed below *will* be down) :
- https://www.centos.org
- https://forums.centos.org
- https://lists.centos.org
- mirrorlist.centos.org
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Hi folks,
Something to be aware of : https://pagure.io/centos-infra/issue/146
"'
There will be a planned outage to try and update the switches in the
CentOS main cage. The outage window will be for 4 hours from 14:00 UTC
until 18:00 UTC but should only be short bursts of switch outages.
Affected services:
CentOS CI
CentOS build systems
Other central services
If the outage has issues which can not be fixed within the 4 hour
window, the backup window for an outage is 2020-12-01 at the same time
area.
"'
Thank you, if you have any questions, please comment them on the ticket
--
Vipul Siddharth
He/His/Him
Fedora | CentOS Infrastructure Team
Hi!
To speed up some of the testing we do on bare-metal machines provisioned
through Duffy, I would like to pull pre-build images from the OpenShift
registry. The images are built through a BuildConfig and placed in an
ImageStream.
Now, it seems that the Duffy provisioned bare-metal systems can not pull
from the internal OpenShift registry:
[root@n46 ~]# podman pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test
Trying to pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test...
Get https://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:…: dial tcp 172.19.0.254:5000: connect: no route to host
Error: error pulling image "image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test": unable to pull image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test: unable to pull image: Error initializing source docker://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000/ceph-csi/ceph-csi:test: error pinging docker registry image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:5000: Get https://image-registry.openshift-image-registry.svc.apps.ocp.ci.centos.org:…: dial tcp 172.19.0.254:5000: connect: no route to host
I wonder if this is intentional, or if this is a little too strict? If
this can not be allowed through the firewall, what is the recommendation
to use these images? Maybe we should deploy our own registry and push
the images there...
Thanks!
Niels
Yesterday (Saturday) evening we got zabbix notifications that some nodes
in CI environment were unreachable. After a quick look, I discovered
that it was an embedded network switch in a chassis hosting multiple
nodes (including but not limited to jenkins node behind ci.centos.org)
that went nuts.
I tried a remote "hardware reset" and nodes were back online after ~10min.
But this morning (sunday), I see through zabbix that same issue happened
again, and in the hour after I already did the "hardware reset", but
this time, even that doesn't work anymore.
So that means that we have a network switch not working anymore.
As that chassis (like almost *all* equipment in CI) *isn't* under
warranty, we'll see on monday what can be done and how we give priority
to try to dispatch services elsewhere (and that probably means then
powering down other services , depending on priority that will be
given), but it's easy to understand that we can't even give any ETA at
this point.
Thanks for your comprehending,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
we had a kernel panic on the storage box used as nfs server for
openshift (both okd and ocp) and machine doesn't come back online due to
md device refusing to start.
machine is now in single-mode to analyze the situation and try to fix it.
We'll send more details and progress when possible
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Due to a hardware maintenance that needs to take place on the NFS
storage node used by openshift ("legacy" and current one - ocp ), we'll
have to shutdown the openshift cluster, and then proceed with hardware
maintenance on the NFS server (that itself needs to be powered down, no
way to actually do that "online")
Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC
time"""".
You can convert to local time with $(date -d '2020-09-30 12:00 UTC')
The expected "downtime" is estimated to ~60 minutes , time needed to
shutdown the machine, install new disks, restart the machine and also do
some updates and tuning on the setup.
For more informations about this, here are some relevant tickets that
were created for the perf issue in openshift and nfs :
https://pagure.io/centos-infra/issue/53https://pagure.io/centos-infra/issue/105https://pagure.io/centos-infra/issue/85https://pagure.io/centos-infra/issue/26
<subliminal message>
PS : worth noting that while we'll investigate reports on new ocp
cluster, we'll probably not spend time investigating in the old/legacy
one, that projects are supposed to migrate away from soon, as the legacy
openshift setup will disappear soon (see
https://pagure.io/centos-infra/issue/16)
</subliminal message>
Thanks for your comprehending and patience.
on behalf of the CI Infra team,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Hi,
As you noticed recently, we started to refresh the infra used for CentOS
CI (not the hardware, still the same, but the software stack and the way
to control/manage it).
One of the identified nodes still being used and that needs to be
converted to the new infra layout is the ssh jumphost (see
https://wiki.centos.org/QaWiki/CI/GettingStarted#How_to_use_it)
Normally, some of you have switched to OpenShift workload, (including to
the new Openshift 4/OCP setup that went live recently) but some Projects
are still on the old setup with sometimes a need to reach
dedicated/shared VMs acting as Jenkins agent[s], connected to Jenkins
behind https://ci.centos.org.
We have already provisioned a new VM in the new setup (that can reach
the whole CI subnet and VLAN) but here are some points to consider,
reason why we wanted to pre-announce long time in advance before we do
the real switch) :
* New ssh jump host is CentOS 8 based, versus CentOS 6, meaning that if
you used ssh-dss key (instead of ssh-rsa), you'll *not* be able to
connect through that new host. We already identified such keys and Vipul
will try (when it's tied to a real email address for the project) to
reach out. But better to announce it here too, so that you have time to
ask us to reflect a change (through ticket on
https://pagure.io/centos-infra/issues)
* Old VM allowed shell access, but it will be disallowed on the new one
(there is no need for shell on that intermediate node anyway). Reminder
that you can configure your ssh config to directly use ProxyCommand or
even now ProxyJump (on recent openssh-client). See
https://wiki.centos.org/TipsAndTricks/SshTips/JumpHost)
* Because the host has a new sshd_host_key, it will come with a new
fingerprint too, so if you have automation and that you don't trust our
CA already, the fingerprint for new host will be :
[fingerprint]
rsa=3072 SHA256:n7y0qZS/FvhjaskOBds3TTKQh5EtgNQ25E7cmTNBATg (RSA)
rsa_md5=3072 MD5:9e:83:46:d0:c5:8a:a0:94:50:10:58:9d:af:ca:50:19 (RSA)
ecdsa=256 SHA256:ZQacwDsWkKBYL9HJJYwHr94Ny1sMhHMDnz9GiLFb8Uc (ECDSA)
ecdsa_md5=256 MD5:dd:24:ea:6a:fd:8b:29:3d:1d:d0:a9:32:8c:b2:ea:62 (ECDSA)
As we know that it's August and that some of you are probably on PTO
(coming back or leaving soon), after discussion with Vipul , David and
myself, we considered that we'll probably go live around beginning of
September.
Should you have any question around that migration, feel free to reply
to this thread (ideally on dedicated ci-users mailing list), or on
irc.freenode.net (#centos-ci)
On behalf of the CentOS CI infra team,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
Hi all,
As you probably noticed in the last weeks/months, we have a stronger
collaboration and synergy with the Fedora infrastructure team. Combining
forces and resources help both projects at the same time, as majority of
the CentOS contributors are already Fedora contributors and probably the
same in reverse.
It's not a secret (it was announced through CPE weekly mails on this
list) that the CentOS board approved the idea of merging authentication
systems in a near future (as an example).
This email to let you know that all RFE/issues concerning the following
areas should be reported to a new issues tracker :
https://pagure.io/centos-infra/issues/ , to adapt the same workflow as
the Fedora infra team is already using. ( see
https://docs.fedoraproject.org/en-US/cpe/working_with_us/ )
Concerned areas :
- https://cbs.centos.org (Community BuildSystem, aka koji)
- Special Interest Groups requests (for mirror, resources, etc)
- https://ci.centos.org (All CI infra ecosystem)
- Everything around CentOS Infra (mirror issues, etc)
We have already moved/migrated for example the (opened) tickets that
were filed under the "Buildsys" , "CI" and "Infrastructure" categories
to the new issues tracker. The idea being to *not* request work to be
done through IRC but rather through new infra issues tracker.
Imported tickets will be discussed there and worked on (reviewed on a
daily basis) after having been prioritized
Should you feel a need to discuss this new process, feel free to do so
in #centos-devel on irc.freenode.net or on this centos-devel list.
Kind Regards,
--
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab