> > > On 11/24/20 11:05 AM, Simon Matter wrote: >>> >>> >>> On 11/24/20 1:20 AM, Simon Matter wrote: >>>>> On 23/11/2020 17:16, Ralf Prengel wrote: >>>>>> Backup!!!!!!!! >>>>>> >>>>>> Von meinem iPhone gesendet >>>>> >>>>> You do have a recent backup available anyway, haven't you? That is: >>>>> Even >>>>> without planning to replace disks. And testing such >>>>> strategies/sequences >>>>> using loopback devices is definitely a good idea to get used to the >>>>> machinery... >>>>> >>>>> On a side note: I have had a fair number of drives die on me during >>>>> RAID-rebuild so I would try to avoid (if at all possible) to >>>>> deliberately reduce redundancy just for a drive swap. I have never >>>>> had >>>>> a >>>>> problem (yet) due to a problem with the RAID-1 kernel code itself. >>>>> And: >>>>> If you have to change a disk because it already has issues it may be >>>>> dangerous to do a backup - especially if you do a file based backups >>>>> - >>>>> because the random access pattern may make things worse. Been there, >>>>> done that... >>>> >>>> Sure, and for large disks I even go further: don't put the whole disk >>>> into >>>> one RAID device but build multiple segments, like create 6 partitions >>>> of >>>> same size on each disk and build six RAID1s out of it. >>> >>> Oh, boy, what a mess this will create! I have inherited a machine which >>> was set up by someone with software RAID like that. You need to replace >>> one drive, other RAIDs which that drive's other partitions are >>> participating are affected too. >>> >>> Now imagine that somehow at some moment you have several RAIDs each of >>> them is not redundant, but in each it is partition from different drive >>> that is kicked out. And now you are stuck unable to remove any of >>> failed >>> drives, removal of each will trash one or another RAID (which are not >>> redundant already). I guess the guy who left me with this setup >>> listened >>> to advises like the one you just gave. What a pain it is to deal with >>> any drive failure on this machine!! >>> >>> It is known since forever: The most robust setup is the simplest one. >> >> I understand that, I also like keeping things simple (KISS). >> >> Now, in my own experience, with these multi terabyte drives today, in >> 95% >> of the cases where you get a problem it is with a single block which can >> not be read fine. A single write to the sector makes the drive remap it >> and problem is solved. That's where a simple resync of the affected RAID >> segment is the fix. If a drive happens to produce such a condition once >> a >> year, there is absolutely no reason to replace the drive, just trigger >> the >> remapping of the bad sector and and drive will remember it in the >> internal >> bad sector map. This happens all the time without giving an error to the >> OS level, as long as the drive could still read and reconstruct the >> correct data. >> >> In the 5% of cases where a drive really fails completely and needs >> replacement, you have to resync the 10 RAID segments, yes. I usually do >> it >> with a small script and it doesn't take more than some minutes. >> > > It is one story if you administer one home server. It is quite different > is you administer a couple of hundreds of them, like I do. And just 2-3 > machines set up in such a disastrous manner as I just described suck > 10-20 times more of my time each compared to any other machine - the > ones I configured hardware for myself, and set up myself, then you are > entitled to say what I said. Your assumptions about my work environment are quite wrong. > > Hence the attitude. > > Keep things simple, so they do not suck your time - if you do it for > living. > > But if it is a hobby of yours - the one that takes all your time, and > gives you a pleasure just to fiddle with it, then it's your time, and > your pleasure, do it the way to get more of it ;-) It was a hobby 35 years ago coding in assembler and designing PCBs for computer extensions. Simon