Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
On 03/03/2024 19:48, Fabian Arrotin wrote:
Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
Status update : cbs.centos.org kojihub was fully reinstalled from scratch on a different hypervisor, reconfigured by Ansible and DB restored from backup that happened earlier today.
Quickly checked and it seems all operations are working fine. The only issue you should eventually see is if you submitted a build today, *after* postgresql backup operation took place, so if that's the case, reconsider rebuilding your rpm (but it's usually quite during the weekend, especially on Sunday)
Next item to reinstall/restore : git.centos.org
On 03/03/2024 20:27, Fabian Arrotin wrote:
On 03/03/2024 19:48, Fabian Arrotin wrote:
Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
Status update : cbs.centos.org kojihub was fully reinstalled from scratch on a different hypervisor, reconfigured by Ansible and DB restored from backup that happened earlier today.
Quickly checked and it seems all operations are working fine. The only issue you should eventually see is if you submitted a build today, *after* postgresql backup operation took place, so if that's the case, reconsider rebuilding your rpm (but it's usually quite during the weekend, especially on Sunday)
Next item to reinstall/restore : git.centos.org
https://git.centos.org is now also fully redeployed from scratch on a different hypervisor, reconfigured fully by ansible and data restored from backup (that's the step that needed more time as I had to restore ~1TiB of data from remote backup server to local pagure instance)
What I (quicky) tried after service was restored : - git pull from various repositories - git commit and push to one specific branch (test only) - verified mqtt notifications were also working - push a random file to lookaside cache (testing identified fasjson api call to verify if I was allowed to push to a specific sig-infra branch)
Everything seems to work but here are some interesting informations , as we fully redeployed the machine, sshd_host_key changed and can be viewed through web ui : https://git.centos.org/ssh_info
Also worth knowing that if you trust our CA, you shouldn't need to worry about key change , as new sshd_host_key is also signed by same CA.
That just means that you should trust this in your ~/.ssh/known_hosts
@cert-authority *.centos.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP
WRT content/git repositories: same remark as for kojihub/cbs : we restored from backup so it can be that you'll have to push again commits (if any) and/or assets to lookaside cache if you used git.centos.org this Sunday
PS: I'm myself normally on PTO/Away/Grief mode so not normally paying attention to the list nor irc. If you encounter any issue due to this unscheduled outage, feel free to open a ticket on pagure.io/centos-infra/issues
Kind Regards,
Great work, thanks Fabian for fixing that over the weekend!
On Sun, Mar 3, 2024 at 10:47 PM Fabian Arrotin arrfab@centos.org wrote:
On 03/03/2024 20:27, Fabian Arrotin wrote:
On 03/03/2024 19:48, Fabian Arrotin wrote:
Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
Status update : cbs.centos.org kojihub was fully reinstalled from scratch on a different hypervisor, reconfigured by Ansible and DB restored from backup that happened earlier today.
Quickly checked and it seems all operations are working fine. The only issue you should eventually see is if you submitted a build today, *after* postgresql backup operation took place, so if that's the case, reconsider rebuilding your rpm (but it's usually quite during the weekend, especially on Sunday)
Next item to reinstall/restore : git.centos.org
https://git.centos.org is now also fully redeployed from scratch on a different hypervisor, reconfigured fully by ansible and data restored from backup (that's the step that needed more time as I had to restore ~1TiB of data from remote backup server to local pagure instance)
What I (quicky) tried after service was restored :
- git pull from various repositories
- git commit and push to one specific branch (test only)
- verified mqtt notifications were also working
- push a random file to lookaside cache (testing identified fasjson api
call to verify if I was allowed to push to a specific sig-infra branch)
Everything seems to work but here are some interesting informations , as we fully redeployed the machine, sshd_host_key changed and can be viewed through web ui : https://git.centos.org/ssh_info
Also worth knowing that if you trust our CA, you shouldn't need to worry about key change , as new sshd_host_key is also signed by same CA.
That just means that you should trust this in your ~/.ssh/known_hosts
@cert-authority *.centos.org ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP
WRT content/git repositories: same remark as for kojihub/cbs : we restored from backup so it can be that you'll have to push again commits (if any) and/or assets to lookaside cache if you used git.centos.org this Sunday
PS: I'm myself normally on PTO/Away/Grief mode so not normally paying attention to the list nor irc. If you encounter any issue due to this unscheduled outage, feel free to open a ticket on pagure.io/centos-infra/issues
Kind Regards,
Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]
CentOS-devel mailing list CentOS-devel@centos.org https://lists.centos.org/mailman/listinfo/centos-devel
Thank you so much Fabian for doing all that while on PTO!
Amy
*Amy Marrich*
She/Her/Hers
Principal Technical Marketing Manager - Cloud Platforms
Red Hat, Inc https://www.redhat.com/
amy@redhat.com
Mobile: 954-818-0514
Slack: amarrich
IRC: spotz https://www.redhat.com/
On Sun, Mar 3, 2024 at 3:47 PM Fabian Arrotin arrfab@centos.org wrote:
On 03/03/2024 20:27, Fabian Arrotin wrote:
On 03/03/2024 19:48, Fabian Arrotin wrote:
Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
Status update : cbs.centos.org kojihub was fully reinstalled from scratch on a different hypervisor, reconfigured by Ansible and DB restored from backup that happened earlier today.
Quickly checked and it seems all operations are working fine. The only issue you should eventually see is if you submitted a build today, *after* postgresql backup operation took place, so if that's the case, reconsider rebuilding your rpm (but it's usually quite during the weekend, especially on Sunday)
Next item to reinstall/restore : git.centos.org
https://git.centos.org is now also fully redeployed from scratch on a different hypervisor, reconfigured fully by ansible and data restored from backup (that's the step that needed more time as I had to restore ~1TiB of data from remote backup server to local pagure instance)
What I (quicky) tried after service was restored :
- git pull from various repositories
- git commit and push to one specific branch (test only)
- verified mqtt notifications were also working
- push a random file to lookaside cache (testing identified fasjson api
call to verify if I was allowed to push to a specific sig-infra branch)
Everything seems to work but here are some interesting informations , as we fully redeployed the machine, sshd_host_key changed and can be viewed through web ui : https://git.centos.org/ssh_info
Also worth knowing that if you trust our CA, you shouldn't need to worry about key change , as new sshd_host_key is also signed by same CA.
That just means that you should trust this in your ~/.ssh/known_hosts
@cert-authority *.centos.org ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP
WRT content/git repositories: same remark as for kojihub/cbs : we restored from backup so it can be that you'll have to push again commits (if any) and/or assets to lookaside cache if you used git.centos.org this Sunday
PS: I'm myself normally on PTO/Away/Grief mode so not normally paying attention to the list nor irc. If you encounter any issue due to this unscheduled outage, feel free to open a ticket on pagure.io/centos-infra/issues
Kind Regards,
Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]
CentOS-devel mailing list CentOS-devel@centos.org https://lists.centos.org/mailman/listinfo/centos-devel
Kudos Fabian for having taken care of this during your PTO.
On Mon, Mar 4, 2024 at 2:07 PM Amy Marrich amy@redhat.com wrote:
Thank you so much Fabian for doing all that while on PTO!
Amy
*Amy Marrich*
She/Her/Hers
Principal Technical Marketing Manager - Cloud Platforms
Red Hat, Inc https://www.redhat.com/
amy@redhat.com
Mobile: 954-818-0514
Slack: amarrich
IRC: spotz https://www.redhat.com/
On Sun, Mar 3, 2024 at 3:47 PM Fabian Arrotin arrfab@centos.org wrote:
On 03/03/2024 20:27, Fabian Arrotin wrote:
On 03/03/2024 19:48, Fabian Arrotin wrote:
Today evening (Sunday), I got zabbix notification that some services hosted on same hypervisor were down. A quick investigation showed me that despite running on a hardware raid controller, said server firware confirm data loss and corruption.
As I'm myself normally on PTO, I still wanted to restore services to quickly working on trying to redeploy from scratch services, and restore data from last backup and hope to have news soon ...
Status update : cbs.centos.org kojihub was fully reinstalled from scratch on a different hypervisor, reconfigured by Ansible and DB restored from backup that happened earlier today.
Quickly checked and it seems all operations are working fine. The only issue you should eventually see is if you submitted a build today, *after* postgresql backup operation took place, so if that's the case, reconsider rebuilding your rpm (but it's usually quite during the weekend, especially on Sunday)
Next item to reinstall/restore : git.centos.org
https://git.centos.org is now also fully redeployed from scratch on a different hypervisor, reconfigured fully by ansible and data restored from backup (that's the step that needed more time as I had to restore ~1TiB of data from remote backup server to local pagure instance)
What I (quicky) tried after service was restored :
- git pull from various repositories
- git commit and push to one specific branch (test only)
- verified mqtt notifications were also working
- push a random file to lookaside cache (testing identified fasjson api
call to verify if I was allowed to push to a specific sig-infra branch)
Everything seems to work but here are some interesting informations , as we fully redeployed the machine, sshd_host_key changed and can be viewed through web ui : https://git.centos.org/ssh_info
Also worth knowing that if you trust our CA, you shouldn't need to worry about key change , as new sshd_host_key is also signed by same CA.
That just means that you should trust this in your ~/.ssh/known_hosts
@cert-authority *.centos.org ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQDXmhva/yVOS6y/sR1Pjd+Gflzkl7azfl3ZIhex5kSHilUjT3DSjfXK0TgSHT93BCKs1/mT84ZKv6s+Ulfc3kC9aykJQnkWJ6I6CjIgfIM547VT2Egx5fKJZ/7yRedYf6HoVPZSAW5WYKZ0fq/DDoAFUuZJkkp3QEzh6TUiXif9qjCu3liXNgkS2uVIWc7+1QTLRxqU3/MCD1YxuOL8ShyMSHlGJTRMMTYq6aAFmlQ/FsA8deb9HeR3PaAZx7Q7jqmiJD5cx9XtrmgM4CCZNFxP9i0s+L7yDKzFQ1ecm1/vzouOsAVcSh7MiAexuBLgbUdhmBDGVEJYQDNENKOdaoiP
WRT content/git repositories: same remark as for kojihub/cbs : we restored from backup so it can be that you'll have to push again commits (if any) and/or assets to lookaside cache if you used git.centos.org this Sunday
PS: I'm myself normally on PTO/Away/Grief mode so not normally paying attention to the list nor irc. If you encounter any issue due to this unscheduled outage, feel free to open a ticket on pagure.io/centos-infra/issues
Kind Regards,
Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]
CentOS-devel mailing list CentOS-devel@centos.org https://lists.centos.org/mailman/listinfo/centos-devel
CentOS-devel mailing list CentOS-devel@centos.org https://lists.centos.org/mailman/listinfo/centos-devel