[CentOS] Replacing SW RAID-1 with SSD RAID-1

Tue Nov 24 19:15:12 UTC 2020
Valeri Galtsev <galtsev at kicp.uchicago.edu>


On 11/24/20 12:44 PM, Simon Matter wrote:
>>
>>
>> On 11/24/20 11:05 AM, Simon Matter wrote:
>>>>
>>>>
>>>> On 11/24/20 1:20 AM, Simon Matter wrote:
>>>>>> On 23/11/2020 17:16, Ralf Prengel wrote:
>>>>>>> Backup!!!!!!!!
>>>>>>>
>>>>>>> Von meinem iPhone gesendet
>>>>>>
>>>>>> You do have a recent backup available anyway, haven't you? That is:
>>>>>> Even
>>>>>> without planning to replace disks. And testing such
>>>>>> strategies/sequences
>>>>>> using loopback devices is definitely a good idea to get used to the
>>>>>> machinery...
>>>>>>
>>>>>> On a side note: I have had a fair number of drives die on me during
>>>>>> RAID-rebuild so I would try to avoid (if at all possible) to
>>>>>> deliberately reduce redundancy just for a drive swap. I have never
>>>>>> had
>>>>>> a
>>>>>> problem (yet) due to a problem with the RAID-1 kernel code itself.
>>>>>> And:
>>>>>> If you have to change a disk because it already has issues it may be
>>>>>> dangerous to do a backup - especially if you do a file based backups
>>>>>> -
>>>>>> because the random access pattern may make things worse. Been there,
>>>>>> done that...
>>>>>
>>>>> Sure, and for large disks I even go further: don't put the whole disk
>>>>> into
>>>>> one RAID device but build multiple segments, like create 6 partitions
>>>>> of
>>>>> same size on each disk and build six RAID1s out of it.
>>>>
>>>> Oh, boy, what a mess this will create! I have inherited a machine which
>>>> was set up by someone with software RAID like that. You need to replace
>>>> one drive, other RAIDs which that drive's other partitions are
>>>> participating are affected too.
>>>>
>>>> Now imagine that somehow at some moment you have several RAIDs each of
>>>> them is not redundant, but in each it is partition from different drive
>>>> that is kicked out. And now you are stuck unable to remove any of
>>>> failed
>>>> drives, removal of each will trash one or another RAID (which are not
>>>> redundant already). I guess the guy who left me with this setup
>>>> listened
>>>> to advises like the one you just gave. What a pain it is to deal with
>>>> any drive failure on this machine!!
>>>>
>>>> It is known since forever: The most robust setup is the simplest one.
>>>
>>> I understand that, I also like keeping things simple (KISS).
>>>
>>> Now, in my own experience, with these multi terabyte drives today, in
>>> 95%
>>> of the cases where you get a problem it is with a single block which can
>>> not be read fine. A single write to the sector makes the drive remap it
>>> and problem is solved. That's where a simple resync of the affected RAID
>>> segment is the fix. If a drive happens to produce such a condition once
>>> a
>>> year, there is absolutely no reason to replace the drive, just trigger
>>> the
>>> remapping of the bad sector and and drive will remember it in the
>>> internal
>>> bad sector map. This happens all the time without giving an error to the
>>> OS level, as long as the drive could still read and reconstruct the
>>> correct data.
>>>
>>> In the 5% of cases where a drive really fails completely and needs
>>> replacement, you have to resync the 10 RAID segments, yes. I usually do
>>> it
>>> with a small script and it doesn't take more than some minutes.
>>>
>>
>> It is one story if you administer one home server. It is quite different
>> is you administer a couple of hundreds of them, like I do. And just 2-3
>> machines set up in such a disastrous manner as I just described suck
>> 10-20 times more of my time each compared to any other machine - the
>> ones I configured hardware for myself, and set up myself, then you are
>> entitled to say what I said.
> 
> Your assumptions about my work environment are quite wrong.

Great, then you are much mightier than I am in managing fast something 
set up very sophisticated way. That is amazing: managing sophisticated 
things the same fast as managing simple straightforward things ;-)

I also noticed one more sophistication you do: you always strip off the 
name of the poster you reply to.  ;-)

> 
>>
>> Hence the attitude.
>>
>> Keep things simple, so they do not suck your time - if you do it for
>> living.
>>
>> But if it is a hobby of yours - the one that takes all your time, and
>> gives you a pleasure just to fiddle with it, then it's your time, and
>> your pleasure, do it the way to get more of it ;-)
> 
> It was a hobby 35 years ago coding in assembler and designing PCBs for
> computer extensions.

Oh, great, we are the same of a kind. I did design electronics, and made 
PCBs both as hobby and for living. And I still do it as a hobby. I also 
did programming both as hobby and for living. The funniest was: I wrote 
for single board Z-80 processor based computer: assembler, disassembler, 
and emulator (that emilated what that Z-80 will do running some 
program). I did it on Wang 2200 (actually replica of such), and I 
programmed it, believe it or not, in Basic. That was the only language 
available for us on that machine. The ugly simple interpretive language 
with all variables global...

But now I'm sysadmin. And - for me at least - the simplest possible 
setup is the one that will be most robust. And it will be the easiest 
and fastest to maintain (both for me or for someone else if one steps in 
to do it instead of me).

Valeri

> 
> Simon
> 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
> 

-- 
++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++