[CentOS-devel] major infra issue : impacting git.centos.org and cbs.centos.org

Mon Mar 4 13:06:34 UTC 2024
Amy Marrich <amy at redhat.com>

Thank you so much Fabian for doing all that while on PTO!

Amy

*Amy Marrich*

She/Her/Hers

Principal Technical Marketing Manager - Cloud Platforms

Red Hat, Inc <https://www.redhat.com/>

amy at redhat.com

Mobile: 954-818-0514

Slack:  amarrich

IRC: spotz
<https://www.redhat.com/>


On Sun, Mar 3, 2024 at 3:47 PM Fabian Arrotin <arrfab at centos.org> wrote:

> On 03/03/2024 20:27, Fabian Arrotin wrote:
> > On 03/03/2024 19:48, Fabian Arrotin wrote:
> >> Today evening (Sunday), I got zabbix notification that some services
> >> hosted on same hypervisor were down.
> >> A quick investigation showed me that despite running on a hardware
> >> raid controller, said server firware confirm data loss and corruption.
> >>
> >> As I'm myself normally on PTO, I still wanted to restore services to
> >> quickly working on trying to redeploy from scratch services, and
> >> restore data from last backup and hope to have news soon ...
> >>
> >
> > Status update : cbs.centos.org kojihub was fully reinstalled from
> > scratch on a different hypervisor, reconfigured by Ansible and DB
> > restored from backup that happened earlier today.
> >
> > Quickly checked and it seems all operations are working fine.
> > The only issue you should eventually see is if you submitted a build
> > today, *after* postgresql backup operation took place, so if that's the
> > case, reconsider rebuilding your rpm (but it's usually quite during the
> > weekend, especially on Sunday)
> >
> > Next item to reinstall/restore : git.centos.org
> >
>
> https://git.centos.org is now also fully redeployed from scratch on a
> different hypervisor, reconfigured fully by ansible and data restored
> from backup (that's the step that needed more time as I had to restore
> ~1TiB of data from remote backup server to local pagure instance)
>
> What I (quicky) tried after service was restored :
> - git pull from various repositories
> - git commit and push to one specific branch (test only)
> - verified mqtt notifications were also working
> - push a random file to lookaside cache (testing identified fasjson api
> call to verify if I was allowed to push to a specific sig-infra branch)
>
> Everything seems to work but here are some interesting informations , as
> we fully redeployed the machine, sshd_host_key changed and can be viewed
> through web ui : https://git.centos.org/ssh_info
>
> Also worth knowing that if you trust our CA, you shouldn't need to worry
> about key change , as new sshd_host_key is also signed by same CA.
>
> That just means that you should trust this in your ~/.ssh/known_hosts
>
> @cert-authority *.centos.org ssh-rsa
>
> AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP
>
>
> WRT content/git repositories: same remark as for kojihub/cbs : we
> restored from backup so it can be that you'll have to push again commits
> (if any) and/or assets to lookaside cache if you used git.centos.org
> this Sunday
>
>
> PS: I'm myself normally on PTO/Away/Grief mode so not normally paying
> attention to the list nor irc. If you encounter any issue due to this
> unscheduled outage, feel free to open a ticket on
> pagure.io/centos-infra/issues
>
> Kind Regards,
> --
> Fabian Arrotin
> The CentOS Project | https://www.centos.org
> gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]
>
> _______________________________________________
> CentOS-devel mailing list
> CentOS-devel at centos.org
> https://lists.centos.org/mailman/listinfo/centos-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20240304/f6c5d6ae/attachment.html>