Hi,
Last week I had a disaster which took me a few unnerving days to repair. My main Internet-facing server is a bare-metal installation with CentOS 7. It hosts four dozen web sites (or web applications) based on WordPress, Dolibarr, OwnCloud, GEPI, and quite a number of mail accounts for ten different domains. On sunday afternoon this machine had a hardware failure and proved to be unrecoverable.
The good news is, I always have backups of everything. In that case, I have a dedicated backup server (in a different datacenter in a different country). I’m using Rsnapshot for incremental backups, so I had all data: websites, mail accounts, database dumps, configurations, etc.
Now here’s the problem: it took me three and a half days of intense work to restore everything and get everything running again. Three and a half days of downtime is quite a stretch.
As far as I understand, my mistake was to use a bare-metal installation and not a virtualized solution where I could simply restore a snapshot of a VM. Correct me if I’m wrong.
Now I’m doing a lot of thinking and searching. Proxmox and Ceph look quite promising. From what I can tell, the idea is not to use a big server but a cluster of many small servers, and aggregate them like you would do with hard disks in a RAID 10 array for example, only you would do this for the whole system. And then install one or several CentOS 7 VMs on top of this setup.
Any advice from the pros before I dive head first into the documentation?
Cheers from the sunny South of France,
Niki
Am 14.03.21 um 07:13 schrieb Nicolas Kovacs:
Now here’s the problem: it took me three and a half days of intense work to restore everything and get everything running again. Three and a half days of downtime is quite a stretch.
What was the real problem? Why did you need days to restore from backups? Maybe the new solution is attached here?
-- Leon
On Mar 14, 2021, at 5:42 AM, Leon Fauster via CentOS centos@centos.org wrote:
Am 14.03.21 um 07:13 schrieb Nicolas Kovacs:
Now here’s the problem: it took me three and a half days of intense work to restore everything and get everything running again. Three and a half days of downtime is quite a stretch.
What was the real problem? Why did you need days to restore from backups? Maybe the new solution is attached here?
I would second what Leon said. Even though my backup is different (bareos), still my estimate of full restore to different machine would be: installation of new system (about 30 min at most), then restore of everything from bareos backup, which will depend on total size of everything to restore, the bottleneck will be 1 Gbps network connection. And I do not think my FreeBSD boxes with dozens of jails are much simpler than Nicolas's front end machine. Restore from backup is just restore from backup.
But under some circumstances that can be even faster. I once had quite important machine died (system board). But I had different hardware running less critical stuff, which accepted the drives from failed machine plus RAID card from it, after boot the only thing was necessary to address was network configuration (due to different device names). (both boxes have 8 port sata/sas backplane, all filesystems of machines live on hardware RAID-6…)
As far as distributed file systems are concerned, they are nice (but with seph you will need to have all boxes with the same size of storage). However, it is more expensive. Cheaper guy - I - goes with hardware RAID, and spare machine (if necessary that is: in a manner of grabbing less important box’s hardware to stick drives from failed into it).
Virtualization: in our shop (we use FreeBSD jails), it provides more security and flexibility. As far as “disaster recovery” is concerned, using jails doesn’t affect it in any way. But often helps to avoid disasters created by sudden conflict between packages, as only inseparable components are run in the same jail, so actual server is a bunch of jails each running one or two services, which gives extra robustness. And if A depends on C and B depends on D, and if C and D conflict with each other, that doesn’t matter when A lives in one jail, and B lives in another.
One example of flexibility I just had another week: I migrated the box with couple of dozens of jail (most of them are independent servers with different IPs, “virtualized” in the manner they run if jails on some machine). To move the whole everything to another machine will take long, noticeable downtime, but moving jails one at a time made downtime of each as short as mere reboot cause. (In general, any sort of virtualization gives you that).
I hope, this helps.
Valeri
-- Leon
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Am 14.03.21 um 07:13 schrieb Nicolas Kovacs:
Now here’s the problem: it took me three and a half days of intense work to restore everything and get everything running again. Three and a half days of downtime is quite a stretch.
What was the real problem? Why did you need days to restore from backups? Maybe the new solution is attached here?
I thought the same. What happened to your previous hardware?
First, using RAID1-6 you should not lose your storage so easily. So what can happen:
a) hardware dies, disks are still fine -> move disks to new hardware and only adjust settings for new hardware. b) one disk dies, means no damage but need to replace disk. c) hardware dies completely with all disks -> new replacement hardware required.
a and b can usually be handled quite fast, possibly have replacement parts ready, c really happens almost never, really.
Then, why did it take so long to get up and running again?
One important thing to keep in mind is that trasferring data from a backup can always take a lot time if there is lot of data involved. Restoring multiple terabytes usually takes more time than one might expect. At least me I usually forget that in my daily work and assume things should go fast with modern hardware. That's not always true with todays storage sizes.
Regards, Simo
How many extra servers can you add to your setup? If I were in your shoes, I would consider building a file server/NAS with fast connection to your server(s). Then share the data to your services to the server (NFS?), export the disk (iscsi) or some combination of both. I hope someone can correct me but I think postfix has issues with use accounts in NFS partitions.
Next step is building your web/mail/etc servers -- be them as VMs or as all in the same baremetal -- as thin as possible so you can recover that quickly (ansible?), mount data fileshares, and off you go. If you are going the vm route you could ether save snapshots or build one of those setups with two servers so in case one goes boink the other takes over. This is also good for upgrading one of the VM servers: do them on different days so you can see if there are problems.
If you cannot have more than one server, do run VMs and then put them in a second set of disks so something happens to boot disk you can recover.
On Sun, Mar 14, 2021 at 1:13 AM Nicolas Kovacs info@microlinux.fr wrote:
Hi,
Last week I had a disaster which took me a few unnerving days to repair. My main Internet-facing server is a bare-metal installation with CentOS 7. It hosts four dozen web sites (or web applications) based on WordPress, Dolibarr, OwnCloud, GEPI, and quite a number of mail accounts for ten different domains. On sunday afternoon this machine had a hardware failure and proved to be unrecoverable.
The good news is, I always have backups of everything. In that case, I have a dedicated backup server (in a different datacenter in a different country). I’m using Rsnapshot for incremental backups, so I had all data: websites, mail accounts, database dumps, configurations, etc.
Now here’s the problem: it took me three and a half days of intense work to restore everything and get everything running again. Three and a half days of downtime is quite a stretch.
As far as I understand, my mistake was to use a bare-metal installation and not a virtualized solution where I could simply restore a snapshot of a VM. Correct me if I’m wrong.
Now I’m doing a lot of thinking and searching. Proxmox and Ceph look quite promising. From what I can tell, the idea is not to use a big server but a cluster of many small servers, and aggregate them like you would do with hard disks in a RAID 10 array for example, only you would do this for the whole system. And then install one or several CentOS 7 VMs on top of this setup.
Any advice from the pros before I dive head first into the documentation?
Cheers from the sunny South of France,
Niki
-- Microlinux - Solutions informatiques durables 7, place de l'église - 30730 Montpezat Site : https://www.microlinux.fr Blog : https://blog.microlinux.fr Mail : info@microlinux.fr Tél. : 04 66 63 10 32 Mob. : 06 51 80 12 12 _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos