[CentOS] Replacing SW RAID-1 with SSD RAID-1

Tue Nov 24 16:50:38 UTC 2020
Stephen John Smoogen <smooge at gmail.com>

On Tue, 24 Nov 2020 at 02:20, Simon Matter <simon.matter at invoca.ch> wrote:

> > On 23/11/2020 17:16, Ralf Prengel wrote:
> >> Backup!!!!!!!!
> >>
> >> Von meinem iPhone gesendet
> >
> > You do have a recent backup available anyway, haven't you? That is: Even
> > without planning to replace disks. And testing such strategies/sequences
> > using loopback devices is definitely a good idea to get used to the
> > machinery...
> >
> > On a side note: I have had a fair number of drives die on me during
> > RAID-rebuild so I would try to avoid (if at all possible) to
> > deliberately reduce redundancy just for a drive swap. I have never had a
> > problem (yet) due to a problem with the RAID-1 kernel code itself. And:
> > If you have to change a disk because it already has issues it may be
> > dangerous to do a backup - especially if you do a file based backups -
> > because the random access pattern may make things worse. Been there,
> > done that...
>
> Sure, and for large disks I even go further: don't put the whole disk into
> one RAID device but build multiple segments, like create 6 partitions of
> same size on each disk and build six RAID1s out of it. So, if there is an
> issue on one disk in one segment, you don't lose redundancy of the whole
> big disk. You can even keep spare segments on separate disks to help in
> case where you can not quickly replace a broken disk. The whole handling
> is still very easy with LVM on top.
>
>
I used to do something like this (but because there isn't enough detail in
the above I am not sure if we are talking the same thing). On older disks
having RAID split over 4 disks with / /var /usr /home allowed for longer
redundancy because drive 1 could have a 'failed' /usr but drive 0,2,3,4
were ok and the rest all worked n full mode because /, /var, /home/,  were
all good. This was because most of the data on /usr would be in a straight
run on each disk. The problem is that a lot of modern disks do not
guarantee that data for any partition will be really next to each other on
the disk. Even before SSD's did this for wear leveling a lot of disks did
this because it was easier to allow the full OS which runs in the Arm chip
on the drive do all the 'map this sector the user wants to this sector on
the disk' in whatever logic makes sense for the type of magnetic media
inside. There is also a lot of silent rewriting going on the disks with the
real capacity of a drive can be 10-20% bigger with those sectors slowly
used as failures in other areas happen. When you start seeing errors, it
means that the drive has no longer any safe sectors and probably has
written /usr all over the disk in order to try to keep it going as long as
it could.. the rest of the partitions will start failing very quickly
afterwards.

Not all disks do this but a good many of them do from commercial SAS to
commodity SATA.. and a lot of the 'Red' and 'Black' NAS drives are doing
this also..

While I still use partition segments to spread things out, I do not do so
for failure handling anymore. And if what I was doing isn't what the
original poster was meaning I look forward to learning it.





> Regards,
> Simon
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>


-- 
Stephen J Smoogen.