On Tue, 24 Nov 2020 at 02:20, Simon Matter simon.matter@invoca.ch wrote:
On 23/11/2020 17:16, Ralf Prengel wrote:
Backup!!!!!!!!
Von meinem iPhone gesendet
You do have a recent backup available anyway, haven't you? That is: Even without planning to replace disks. And testing such strategies/sequences using loopback devices is definitely a good idea to get used to the machinery...
On a side note: I have had a fair number of drives die on me during RAID-rebuild so I would try to avoid (if at all possible) to deliberately reduce redundancy just for a drive swap. I have never had a problem (yet) due to a problem with the RAID-1 kernel code itself. And: If you have to change a disk because it already has issues it may be dangerous to do a backup - especially if you do a file based backups - because the random access pattern may make things worse. Been there, done that...
Sure, and for large disks I even go further: don't put the whole disk into one RAID device but build multiple segments, like create 6 partitions of same size on each disk and build six RAID1s out of it. So, if there is an issue on one disk in one segment, you don't lose redundancy of the whole big disk. You can even keep spare segments on separate disks to help in case where you can not quickly replace a broken disk. The whole handling is still very easy with LVM on top.
I used to do something like this (but because there isn't enough detail in the above I am not sure if we are talking the same thing). On older disks having RAID split over 4 disks with / /var /usr /home allowed for longer redundancy because drive 1 could have a 'failed' /usr but drive 0,2,3,4 were ok and the rest all worked n full mode because /, /var, /home/, were all good. This was because most of the data on /usr would be in a straight run on each disk. The problem is that a lot of modern disks do not guarantee that data for any partition will be really next to each other on the disk. Even before SSD's did this for wear leveling a lot of disks did this because it was easier to allow the full OS which runs in the Arm chip on the drive do all the 'map this sector the user wants to this sector on the disk' in whatever logic makes sense for the type of magnetic media inside. There is also a lot of silent rewriting going on the disks with the real capacity of a drive can be 10-20% bigger with those sectors slowly used as failures in other areas happen. When you start seeing errors, it means that the drive has no longer any safe sectors and probably has written /usr all over the disk in order to try to keep it going as long as it could.. the rest of the partitions will start failing very quickly afterwards.
Not all disks do this but a good many of them do from commercial SAS to commodity SATA.. and a lot of the 'Red' and 'Black' NAS drives are doing this also..
While I still use partition segments to spread things out, I do not do so for failure handling anymore. And if what I was doing isn't what the original poster was meaning I look forward to learning it.
Regards, Simon
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos