[CentOS-devel] major infra issue : impacting git.centos.org and cbs.centos.org

Sun Mar 3 21:47:02 UTC 2024
Fabian Arrotin <arrfab at centos.org>

On 03/03/2024 20:27, Fabian Arrotin wrote:
> On 03/03/2024 19:48, Fabian Arrotin wrote:
>> Today evening (Sunday), I got zabbix notification that some services 
>> hosted on same hypervisor were down.
>> A quick investigation showed me that despite running on a hardware 
>> raid controller, said server firware confirm data loss and corruption.
>>
>> As I'm myself normally on PTO, I still wanted to restore services to 
>> quickly working on trying to redeploy from scratch services, and 
>> restore data from last backup and hope to have news soon ...
>>
> 
> Status update : cbs.centos.org kojihub was fully reinstalled from 
> scratch on a different hypervisor, reconfigured by Ansible and DB 
> restored from backup that happened earlier today.
> 
> Quickly checked and it seems all operations are working fine.
> The only issue you should eventually see is if you submitted a build 
> today, *after* postgresql backup operation took place, so if that's the 
> case, reconsider rebuilding your rpm (but it's usually quite during the 
> weekend, especially on Sunday)
> 
> Next item to reinstall/restore : git.centos.org
> 

https://git.centos.org is now also fully redeployed from scratch on a 
different hypervisor, reconfigured fully by ansible and data restored 
from backup (that's the step that needed more time as I had to restore 
~1TiB of data from remote backup server to local pagure instance)

What I (quicky) tried after service was restored :
- git pull from various repositories
- git commit and push to one specific branch (test only)
- verified mqtt notifications were also working
- push a random file to lookaside cache (testing identified fasjson api 
call to verify if I was allowed to push to a specific sig-infra branch)

Everything seems to work but here are some interesting informations , as 
we fully redeployed the machine, sshd_host_key changed and can be viewed 
through web ui : https://git.centos.org/ssh_info

Also worth knowing that if you trust our CA, you shouldn't need to worry 
about key change , as new sshd_host_key is also signed by same CA.

That just means that you should trust this in your ~/.ssh/known_hosts

@cert-authority *.centos.org ssh-rsa 
AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP


WRT content/git repositories: same remark as for kojihub/cbs : we 
restored from backup so it can be that you'll have to push again commits 
(if any) and/or assets to lookaside cache if you used git.centos.org 
this Sunday


PS: I'm myself normally on PTO/Away/Grief mode so not normally paying 
attention to the list nor irc. If you encounter any issue due to this 
unscheduled outage, feel free to open a ticket on 
pagure.io/centos-infra/issues

Kind Regards,
-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20240303/4c218fe8/attachment.sig>