[CentOS] Replacing SW RAID-1 with SSD RAID-1

Tue Nov 24 18:44:31 UTC 2020
Simon Matter <simon.matter at invoca.ch>

> On 11/24/20 11:05 AM, Simon Matter wrote:
>>> On 11/24/20 1:20 AM, Simon Matter wrote:
>>>>> On 23/11/2020 17:16, Ralf Prengel wrote:
>>>>>> Backup!!!!!!!!
>>>>>> Von meinem iPhone gesendet
>>>>> You do have a recent backup available anyway, haven't you? That is:
>>>>> Even
>>>>> without planning to replace disks. And testing such
>>>>> strategies/sequences
>>>>> using loopback devices is definitely a good idea to get used to the
>>>>> machinery...
>>>>> On a side note: I have had a fair number of drives die on me during
>>>>> RAID-rebuild so I would try to avoid (if at all possible) to
>>>>> deliberately reduce redundancy just for a drive swap. I have never
>>>>> had
>>>>> a
>>>>> problem (yet) due to a problem with the RAID-1 kernel code itself.
>>>>> And:
>>>>> If you have to change a disk because it already has issues it may be
>>>>> dangerous to do a backup - especially if you do a file based backups
>>>>> -
>>>>> because the random access pattern may make things worse. Been there,
>>>>> done that...
>>>> Sure, and for large disks I even go further: don't put the whole disk
>>>> into
>>>> one RAID device but build multiple segments, like create 6 partitions
>>>> of
>>>> same size on each disk and build six RAID1s out of it.
>>> Oh, boy, what a mess this will create! I have inherited a machine which
>>> was set up by someone with software RAID like that. You need to replace
>>> one drive, other RAIDs which that drive's other partitions are
>>> participating are affected too.
>>> Now imagine that somehow at some moment you have several RAIDs each of
>>> them is not redundant, but in each it is partition from different drive
>>> that is kicked out. And now you are stuck unable to remove any of
>>> failed
>>> drives, removal of each will trash one or another RAID (which are not
>>> redundant already). I guess the guy who left me with this setup
>>> listened
>>> to advises like the one you just gave. What a pain it is to deal with
>>> any drive failure on this machine!!
>>> It is known since forever: The most robust setup is the simplest one.
>> I understand that, I also like keeping things simple (KISS).
>> Now, in my own experience, with these multi terabyte drives today, in
>> 95%
>> of the cases where you get a problem it is with a single block which can
>> not be read fine. A single write to the sector makes the drive remap it
>> and problem is solved. That's where a simple resync of the affected RAID
>> segment is the fix. If a drive happens to produce such a condition once
>> a
>> year, there is absolutely no reason to replace the drive, just trigger
>> the
>> remapping of the bad sector and and drive will remember it in the
>> internal
>> bad sector map. This happens all the time without giving an error to the
>> OS level, as long as the drive could still read and reconstruct the
>> correct data.
>> In the 5% of cases where a drive really fails completely and needs
>> replacement, you have to resync the 10 RAID segments, yes. I usually do
>> it
>> with a small script and it doesn't take more than some minutes.
> It is one story if you administer one home server. It is quite different
> is you administer a couple of hundreds of them, like I do. And just 2-3
> machines set up in such a disastrous manner as I just described suck
> 10-20 times more of my time each compared to any other machine - the
> ones I configured hardware for myself, and set up myself, then you are
> entitled to say what I said.

Your assumptions about my work environment are quite wrong.

> Hence the attitude.
> Keep things simple, so they do not suck your time - if you do it for
> living.
> But if it is a hobby of yours - the one that takes all your time, and
> gives you a pleasure just to fiddle with it, then it's your time, and
> your pleasure, do it the way to get more of it ;-)

It was a hobby 35 years ago coding in assembler and designing PCBs for
computer extensions.