Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
I hope the above make sense.
Thank you in advance.
Nikos
I have done this a couple of times successfully.
I did set the boot partitions etc as RAID1 on sda and sdb. This I believe is an old instruction and was based on the fact that the kernel needed access to these partitions before RAID access was available.
I'm sure someone more knowledgeable will be able to say whether this is still required.
Gary On Thursday 27 June 2019 14:36:37 Nikos Gatsis - Qbit wrote:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
I hope the above make sense.
Thank you in advance.
Nikos
At Thu, 27 Jun 2019 14:48:30 +0100 CentOS mailing list centos@centos.org wrote:
I have done this a couple of times successfully.
I did set the boot partitions etc as RAID1 on sda and sdb. This I believe is an old instruction and was based on the fact that the kernel needed access to these partitions before RAID access was available.
Actually *grub* needs access to /boot to load the kernel. I don't believe that grub can access (software) RAID filesystems. RAID1 is effectively an exception because it is just a mirror set and grub can [RO] access any one of the mirror set elements as a standalone disk. Note that UEFI partitions can't be RAID at all (and are FAT filesystems) and need to be accessable by the BIOS / boot EEPROM. Once the kernel starts, the raid array(s) can be started, then LVM volumes can be scanned for and set up, then the root file system mounted, and then the system is up and running -- all of that magic is handled in the initramfs.
So the rule of thumb is a "small" /boot/efi FAT file system (if using UEFI boot) a /boot mirror set, and the rest whatever RAID logic, probably with LVM on top of that. Usually one creates a UEFI partition on both (or all three or more) disks -- they can't be a mirror set, but certainly can be rsync'ed regularly. Then a smallish mirror set for /boot, than whatever is left used for the main system filesystem: RAID whatever, etc.
I'm sure someone more knowledgeable will be able to say whether this is still required.
Yes. See above.
Gary On Thursday 27 June 2019 14:36:37 Nikos Gatsis - Qbit wrote:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
I hope the above make sense.
Thank you in advance.
Nikos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 6/27/19 10:27 AM, Robert Heller wrote:
Actually*grub* needs access to /boot to load the kernel. I don't believe that grub can access (software) RAID filesystems. RAID1 is effectively an exception because it is just a mirror set and grub can [RO] access any one of the mirror set elements as a standalone disk. Note that UEFI partitions can't be RAID at all (and are FAT filesystems) and need to be accessable by the BIOS / boot EEPROM.
/boot/efi has the same exception, as long as you use metadata format 1.0. Early versions of CentOS 7 did not allow the use of RAID 1, because it was possible in theory for the firmware or for an alternate OS that shared the EFI partition to modify the filesystem and invalidate the mirror. That restriction has been removed, and Anaconda will now allow you to create /boot/efi as a RAID1 device. This should be mostly safe as long as your Linux OS is the only OS that modifies the EFI filesystem.
Thank you all for your answers.
Nikos.
On 27/6/2019 4:48 μ.μ., Gary Stainburn wrote:
I have done this a couple of times successfully.
I did set the boot partitions etc as RAID1 on sda and sdb. This I believe is an old instruction and was based on the fact that the kernel needed access to these partitions before RAID access was available.
I'm sure someone more knowledgeable will be able to say whether this is still required.
Gary On Thursday 27 June 2019 14:36:37 Nikos Gatsis - Qbit wrote:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
I hope the above make sense.
Thank you in advance.
Nikos
I'd isolate all that RAID stuff from your OS, so the root, /boot, /usr, /etc /tmp, /bin swap are on "normal" partition(s). I know I'm missing some directories, but the point is you should be able to unmount that RAID stuff to adjust it without crippling your system.
https://www.howtogeek.com/117435/htg-explains-the-linux-directory-structure-...
On 6/27/19, 9:37 AM, "CentOS on behalf of Nikos Gatsis - Qbit" <centos-bounces@centos.org on behalf of ngatsis@qbit.gr> wrote:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
I hope the above make sense.
Thank you in advance.
Nikos
_______________________________________________ CentOS mailing list CentOS@centos.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.centos.org_mailma...
This message contains information which may be confidential and privileged. Unless you are the intended recipient (or authorized to receive this message for the intended recipient), you may not use, copy, disseminate or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply e-mail, and delete the message. Thank you very much.
On Thu, 27 Jun 2019, Peda, Allan (NYC-GIS) wrote:
I'd isolate all that RAID stuff from your OS, so the root, /boot, /usr, /etc /tmp, /bin swap are on "normal" partition(s). I know I'm missing some directories, but the point is you should be able to unmount that RAID stuff to adjust it without crippling your system.
https://www.howtogeek.com/117435/htg-explains-the-linux-directory-structure-...
As long as you want none of the advantages of RAID to apply to your system as a whole.
jh
Which may very well be the case.
On 6/27/19, 10:40 AM, "CentOS on behalf of John Hodrien" <centos-bounces@centos.org on behalf of J.H.Hodrien@leeds.ac.uk> wrote:
On Thu, 27 Jun 2019, Peda, Allan (NYC-GIS) wrote:
> I'd isolate all that RAID stuff from your OS, so the root, /boot, /usr, /etc /tmp, /bin swap are on "normal" partition(s). I know I'm missing some directories, but the point is you should be able to unmount that RAID stuff to adjust it without crippling your system. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.howtogeek.com_11743...
As long as you want none of the advantages of RAID to apply to your system as a whole.
jh _______________________________________________ CentOS mailing list CentOS@centos.org https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.centos.org_mailma...
This message contains information which may be confidential and privileged. Unless you are the intended recipient (or authorized to receive this message for the intended recipient), you may not use, copy, disseminate or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply e-mail, and delete the message. Thank you very much.
Am 27.06.2019 um 15:36 schrieb Nikos Gatsis - Qbit:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
You hopefully plan to use just 3 of the disks for the RAID 5 array and the 4th as a hotspare.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
It does. See the RHEL 7 installation documentation.
I hope the above make sense.
Thank you in advance.
Nikos
Alexander
On 6/27/19 6:36 AM, Nikos Gatsis - Qbit wrote:
Do I have to consider anything before installation, because the disks are very large?
Probably not. You'll need to use GPT because they're large, but for a new server you probably would need to do that anyway in order to boot under UEFI.
The partition layout should be the same on all disks. /boot and /boot/efi must be either RAID1 or regular partitions, rather than RAID5.
Le 27/06/2019 à 15:36, Nikos Gatsis - Qbit a écrit :
Do I have to consider anything before installation, because the disks are very large?
I'm doing this kind of installation quite regularly. Here's my two cents.
1. Use RAID6 instead of RAID5. You'll lose a little space, but you'll gain quite some redundancy.
2. The initial sync will be very (!) long, something like a day or two. You can use your server during that time, but he'll not be very responsive.
3. Here's a neat little trick you can use to speed up the initial sync.
$ sudo echo 50000 > /proc/sys/dev/raid/speed_limit_min
I've written a detailed blog article about the kind of setup you want. It's in French, but the Linux bits are universal.
https://www.microlinux.fr/serveur-lan-centos-7/
Cheers,
Niki
If you can afford it I would prefer to use RAID10. You will loose half of disk space but you will get really faster system. It depends what you need / what you will use server for.
Mirek
28.6.2019 at 7:01 Nicolas Kovacs:
Le 27/06/2019 à 15:36, Nikos Gatsis - Qbit a écrit :
Do I have to consider anything before installation, because the disks are very large?
I'm doing this kind of installation quite regularly. Here's my two cents.
- Use RAID6 instead of RAID5. You'll lose a little space, but you'll
gain quite some redundancy.
- The initial sync will be very (!) long, something like a day or two.
You can use your server during that time, but he'll not be very responsive.
Here's a neat little trick you can use to speed up the initial sync.
$ sudo echo 50000 > /proc/sys/dev/raid/speed_limit_min
I've written a detailed blog article about the kind of setup you want. It's in French, but the Linux bits are universal.
https://www.microlinux.fr/serveur-lan-centos-7/
Cheers,
Niki
On Fri, Jun 28, 2019 at 07:01:00AM +0200, Nicolas Kovacs wrote:
- Here's a neat little trick you can use to speed up the initial sync.
$ sudo echo 50000 > /proc/sys/dev/raid/speed_limit_min
I've written a detailed blog article about the kind of setup you want. It's in French, but the Linux bits are universal.
You can't have actually tested these instructions if you think 'sudo echo > /path' actually works.
The idiom for this is typically:
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min
Le 28/06/2019 à 14:28, Jonathan Billings a écrit :
You can't have actually tested these instructions if you think 'sudo echo > /path' actually works.
The idiom for this is typically:
echo 50000 | sudo tee /proc/sys/dev/raid/speed_limit_min
My bad.
The initial article used this instruction as root. And I've replaced most of these with sudo. I've overlooked this one.
Thanks for the heads up.
Nikos Gatsis - Qbit wrote on 6/27/2019 8:36 AM:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
Hi Nikos, I've read the other posts in this thread and wanted to provide my perspective. I've used Linux RAID at various times over the past 10-20 years with both desktop and server class hardware. I've also used hardware RAID controllers from 3ware, Adaptec, LSI, AMI, and others with IDE, SATA, SAS, and SCSI drives. The goal of RAID 1 and above is to increase availability. Unfortunately, I've never had Linux software RAID improve availability - it has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Given that Linux software RAID does not achieve the goal of RAID (improved availability), my recommendation would be to avoid it. If you are looking for a backup mechanism, RAID is not it (use a backup program instead). If you do need high availability, my recommendation is to purchase an LSI based RAID controller. If you plan to use RAID 5, make sure the model you choose has a write cache (this could double the cost of the controller). Used IBM, HP, or Dell RAID controllers are available for a reasonable price or you can purchase a new one from Newegg or wherever. SAS RAID controllers will work with either SAS or SATA drives and you can purchase the appropriate breakout cables for connecting the controller to individual drives. Since you're planning on using 3TB+ drives that are likely 4k native sector, I'd recommend a newer model controller like the Dell PERC H730 (LSI MegaRAID SAS 9361-8i) for RAID5/6 or a PERC H330 (LSI MegaRAID SAS 9341-8i) for RAID 0/1/10.
Am 28.06.2019 um 16:46 schrieb Blake Hudson blake@ispn.net:
Nikos Gatsis - Qbit wrote on 6/27/2019 8:36 AM:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
Hi Nikos, I've read the other posts in this thread and wanted to provide my perspective. I've used Linux RAID at various times over the past 10-20 years with both desktop and server class hardware. I've also used hardware RAID controllers from 3ware, Adaptec, LSI, AMI, and others with IDE, SATA, SAS, and SCSI drives. The goal of RAID 1 and above is to increase availability. Unfortunately, I've never had Linux software RAID improve availability - it has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Given that Linux software RAID does not achieve the goal of RAID (improved availability), my recommendation would be to avoid it. If you are looking for a backup mechanism, RAID is not it (use a backup program instead). If you do need high availability, my recommendation is to purchase an LSI based RAID controller. If you plan to use RAID 5, make sure the model you choose has a write cache (this could double the cost of the controller). Used IBM, HP, or Dell RAID controllers are available for a reasonable price or you can purchase a new one from Newegg or wherever. SAS RAID controllers will work with either SAS or SATA drives and you can purchase the appropriate breakout cables for connecting the controller to individual drives. Since you're planning on using 3TB+ drives that are likely 4k native sector, I'd recommend a newer model controller like the Dell PERC H730 (LSI MegaRAID SAS 9361-8i) for RAID5/6 or a PERC H330 (LSI MegaRAID SAS 9341-8i) for RAID 0/1/10.
We have good experiences with MD RAID (Linux software RAID) - for having data redundancy at low cost. For availability we use clustering (different hardware level) ...
-- LF
On 29/06/19 2:46 AM, Blake Hudson wrote:
Nikos Gatsis - Qbit wrote on 6/27/2019 8:36 AM:
Hello list.
The next days we are going to install Centos 7 on a new server, with 4*3Tb sata hdd as raid-5. We will use the graphical interface to install and set up raid.
Do I have to consider anything before installation, because the disks are very large?
Does the graphical use the parted to set/format the raid?
Hi Nikos, I've read the other posts in this thread and wanted to provide my perspective. I've used Linux RAID at various times over the past 10-20 years with both desktop and server class hardware. I've also used hardware RAID controllers from 3ware, Adaptec, LSI, AMI, and others with IDE, SATA, SAS, and SCSI drives. The goal of RAID 1 and above is to increase availability. Unfortunately, I've never had Linux software RAID improve availability - it has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Sorry for your poor experience. I have used and achieved much improved availability by using Linux Software RAID - most often I use RAID 1 and had disks fail with no impact to the client other than slightly reduced response times (in fact they were totally unaware that a drive had failed, until I told them). Replaced the faulty drive (done by a local person who barely knew how to use a screw driver), resynchronized and all is well - zero data lost. It was a hot swap bay and thus the server did not even have to be powered down - zero customer noticed impact - 100% availability.
Given that Linux software RAID does not achieve the goal of RAID (improved availability), my recommendation would be to avoid it. If you are looking for a backup mechanism, RAID is not it (use a backup program instead). If you do need high availability, my recommendation is to purchase an LSI based RAID controller. If you plan to use RAID 5, make sure the model you choose has a write cache (this could double the cost of the controller). Used IBM, HP, or Dell RAID controllers are available for a reasonable price or you can purchase a new one from Newegg or wherever. SAS RAID controllers will work with either SAS or SATA drives and you can purchase the appropriate breakout cables for connecting the controller to individual drives. Since you're planning on using 3TB+ drives that are likely 4k native sector, I'd recommend a newer model controller like the Dell PERC H730 (LSI MegaRAID SAS 9361-8i) for RAID5/6 or a PERC H330 (LSI MegaRAID SAS 9341-8i) for RAID 0/1/10.
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Just a comment: what RAID 6 (we use that instead of 5, as of years ago), was much larger storage.
When you have, say, over 0.3petabytes, that starts to matter.
mark
On 6/28/19 4:46 PM, Blake Hudson wrote:
Unfortunately, I've never had Linux software RAID improve availability - it has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
I have to add my data point, and it is an opposite experience.
Software RAID1 and RAID5 (and RAID10) have done their job perfectly for me with disk failing and being replaced without issues; neither the resync is a too noticeable speed degradation.
On the other hand, hardware RAID boards have always been a disaster. Slow and ridiculous BIOS utilities, drives being pushed out of the array randomly, SMART data not available anymore and undocumented "formatting" headers on drives so good luck finding an identical controller when the board dies (yeah, with a battery onboard, not the best component for years of reliability...).
It is always software RAID for me. Software RAID + LVM on top is great. For example: RAID1 with sda1 sdb1 sdc1 sdd1 for /boot (yes, 4 disk RAID1, and have a look at "mdadm -e" to have it bootable without the bootloader even knowing it is a RAID), then sd{a,b,c,d}{2,3,4,5,6,...} partitions of reasonable sizes (e.g. 500GB), composed as you prefer, such as RAID1 between sda2-sdb2, RAID1 between sdc2-sdd2, RAID5 between sda3-sdb3-sdc3-sdd4, RAID5 between sda4-sdb4-sdc4-sdd4, ... then pvcreate on the RAID assemblies to place your vgs and lvs. Any movement/enlargement of filesystem will be easy thanks to LVM. Any drive failure will be easy thanks to the Software RAID. You can basically never need to turn off the system anymore.
Regards.
On Jun 28, 2019, at 8:46 AM, Blake Hudson blake@ispn.net wrote:
Linux software RAID…has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Would you care to be more specific? I have little experience with software RAID, other than ZFS, so I don’t know what these “issues” might be.
I do have a lot of experience with hardware RAID, and the grass isn’t very green on that side of the fence, either. Some of this will repeat others’ points, but it’s worth repeating, since it means they’re not alone in their pain:
0. Hardware RAID is a product of the time it was produced. My old parallel IDE and SCSI RAID cards are useless because you can’t get disks with that port type any more; my oldest SATA and SAS RAID cards can’t talk to disks bigger than 2 TB; and of those older hardware RAID cards that still do work, they won’t accept a RAID created by a controller of another type, even if it’s from the same company. (Try attaching a 3ware 8000-series RAID to a 3ware 9000-series card, for example.)
Typical software RAID never drops backwards compatibility. You can always attach an old array to new hardware. Or even new arrays to old hardware, within the limitations of the hardware, and those limitations aren’t the software RAID’s fault.
1. Hardware RAID requires hardware-specific utilities. Many hardware RAID systems don’t work under Linux at all, and of of those that do, not all provide sufficiently useful Linux-side utilities. If you have to reboot into the RAID BIOS to fix anything, that’s bad for availability.
2. The number of hardware RAID options is going down over time. Adaptec’s almost out of the game, 3ware was bought by LSI and then had their products all but discontinued, and most of the other options you list are rebadged LSI or Adaptec. Eventually it’s going to be LSI or software RAID, and then LSI will probably get out of the game, too. This market segment is dying because software RAID no longer has any practical limitations that hardware can fix.
3. When you do get good-enough Linux-side utilities, they’re often not well-designed. I don’t know anyone who likes the megaraid or megacli64 utilities. I have more experience with 3ware’s tw_cli, and I never developed facility with it beyond pidgin, so that to do anything even slightly uncommon, I have to go back to the manual to piece the command together, else risk roaching the still-working disks.
By contrast, I find the zfs and zpool commands well-designed and easy to use. There’s no mystery why that should be so: hardware RAID companies have their expertise in hardware, not software. Also, “man zpool” doesn’t suck. :)
That coin does have an obverse face, which is that young software RAID systems go through a phase where they have to re-learn just how false, untrustworthy, unreliable, duplicitous, and mendacious the underlying hardware can be. But that expertise builds up over time, so that a mature software RAID system copes quite well with the underlying hardware’s failings.
The inverse expertise in software design doesn’t build up on the hardware RAID side. I assume this is because they fire the software teams once they’ve produced a minimum viable product, then re-hire a new team when their old utilities and monitoring software gets so creaky that it has to be rebuilt from scratch. Then you get a *new* bag of ugliness in the world.
Software RAID systems, by contrast, evolve continuously, and so usually tend towards perfection.
The same problem *can* come up in the software RAID world: witness how much wheel reinvention is going on in the Stratis project! The same amount of effort put into ZFS would have been a better use of everyone’s time.
That option doesn’t even exist on the hardware RAID side, though. Every hardware RAID provider must develop their command line utilities and monitoring software de novo, because even if the Other Company open-sourced its software, that other software can’t work with their proprietary hardware.
4. Because hardware RAID is abstracted below the OS layer, the OS and filesystem have no way to interact intelligently with it.
ZFS is at the pinnacle of this technology here, but CentOS is finally starting to get this through Stratis and the extensions Stratis has required to XFS and LVM. I assume btrfs also provides some of these benefits, though that’s on track to becoming off-topic here.
ZFS can tell you which file is affected by a block that’s bad across enough disks that redundancy can’t fix it. This gives you a new, efficient, recovery option: restore that file from backup or delete it, allowing the underlying filesystem to rewrite the bad block on all disks. With hardware RAID, fixing this requires picking one disk as the “real” copy and telling the RAID card to blindly rewrite all the other copies.
Another example is resilvering: because a hardware RAID has no knowledge of the filesystem, a resilver during disk replacement requires rewriting the entire disk, which takes 8-12 hours these days. If the volume has a lot of free space, a filesystem-aware software RAID resilver can copy only the blocks containing user data, greatly reducing recovery time.
Anecdotally, I can tell you that the ECCs involved in NAS-grade SATA hardware aren’t good enough on their own. We had a ZFS server that would detect about 4-10 kB of bad data on one disk in the pool during every weekend scrub. We never figured out whether the problem was in the disk, its drive cage slot, or its cabling, but it was utterly repeatable. But also utterly unimportant to diagnose, because ZFS kept fixing the problem for us, automatically!
The thing is, we’d have never known about this underlying hardware fault if ZFS’s 128-bit checksums weren’t able to reduce the chances of undetected error to practically-impossible levels. Since ZFS knows, by those same 128-bit hashes, which copy of the data is uncorrupted, it fixed it automatically for us each time for years on end. I doubt any hardware RAID system you favor would have fared as well.
*That’s* uptime. :)
5. Hardware RAID made sense back when a PC motherboard rarely had more than 2 hard disk controller ports, and those were shared a single IDE lane. In those days, CPUs were slow enough that calculating parity was really costly, and hard drives were small enough that 8+ disk arrays were often required just to get enough space.
Now that you can get 10+ SATA ports on a mobo, parity calculation costs only a tiny slice of a single core in your multicore CPU, and a mirrored pair of multi-terabyte disks is often plenty of space, hardware RAID is increasingly being pushed to the margins of the server world.
Software RAID doesn’t have port count limits at all. With hardware RAID, I don’t buy a 4-port card when a 2-port card will do, because that costs me $100-200 more. With software RAID, I can usually find another place to plug in a drive temporarily, and that port was “free” because it came with the PC.
This matters when I have to replace a disk in my hardware RAID mirror, because now I’m out of ports. I have to choose one of the disks to drop out of the array, losing all redundancy before the recovery even starts, because I need to free up one of the two hardware connectors for the new disk.
That’s fine when the disk I’m replacing is dead, dead, dead, but that isn’t usually the case in my experience. Instead, the disk I’m replacing is merely *dying*, and I’m hoping to get it replaced before it finally dies.
What that means in practice is that with software RAID, I can have an internal mirror, then temporarily connect a replacement drive in a USB or Thunderbolt disk enclosure. Now the resilver operation proceeds with both original disks available, so that if we find that the “good” disk in the original mirror has a bad sector, too, the software RAID system might find that it can pull a good copy from the “bad” disk, saving the whole operation.
Only once the resilver is complete do I have to choose which disk to drop out of the array in a software RAID system. If I choose incorrectly, the software RAID stops work and lets me choose again.
With hardware RAID, if I choose incorrectly, it’s on the front end of the operation instead, so I’ll end up spending 8-12 hours to create a redundant copy of “Wrong!”
Bottom line: I will not shed a tear when my last hardware RAID goes away.
IMHO, Hardware raid primarily exists because of Microsoft Windows and VMware esxi, neither of which have good native storage management.
Because of this, it's fairly hard to order a major brand (HP, Dell, etc) server without raid cards.
Raid cards do have the performance boost of nonvolatile write back cache. Newer/better cards use supercap flash for this, so battery life is no longer an issue
That said, make my Unix boxes zfs or mdraid+xfs on jbod for all the reasons previously given.
IMHO, Hardware raid primarily exists because of Microsoft Windows and VMware esxi, neither of which have good native storage management.
Because of this, it's fairly hard to order a major brand (HP, Dell, etc) server without raid cards.
Raid cards do have the performance boost of nonvolatile write back cache. Newer/better cards use supercap flash for this, so battery life is no
The supercaps may be more stable than batteries but they can also fail. Since I had to replace the supercap of a HP server I know they also do fail. That's why they are also built as a module connected to the controller :-)
As for the write back cache, good SSDs do the same with integrated cache and supercaps, so you really don't need the RAID controller to do it anymore.
longer an issue
That said, make my Unix boxes zfs or mdraid+xfs on jbod for all the reasons previously given.
Same here, after long years of all kind of RAID hardware, I'm happy to run everything on mdraid+xfs. Software RAID on directly attached U.2 NMVe disks is all we use for new servers. It's fast, stable and also important, still KISS.
Regards, Simon
Warren Young wrote on 6/28/2019 6:53 PM:
On Jun 28, 2019, at 8:46 AM, Blake Hudson blake@ispn.net wrote:
Linux software RAID…has only decreased availability for me. This has been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Would you care to be more specific? I have little experience with software RAID, other than ZFS, so I don’t know what these “issues” might be.
I've never used ZFS, as its Linux support has been historically poor. My comments are limited to mdadm. I've experienced three faults when using Linux software raid (mdadm) on RH/RHEL/CentOS and I believe all of them resulted in more downtime than would have been experienced without the RAID: 1) A single drive failure in a RAID4 or 5 array (desktop IDE) caused the entire system to stop responding. The result was a degraded (from the dead drive) and dirty (from the crash) array that could not be rebuilt (either of the former conditions would have been fine, but not both due to buggy Linux software). 2) A single drive failure in a RAID1 array (Supermicro SCSI) caused the system to be unbootable. We had to update the BIOS to boot from the working drive and possibly grub had to be repaired or reinstalled as I recall (it's been a long time). 3) A single drive failure in a RAID 4 or 5 array (desktop IDE) was not clearly identified and required a bit of troubleshooting to pinpoint which drive had failed.
Unfortunately, I've never had an experience where a drive just failed cleanly and was marked bad by Linux software RAID and could then be replaced without fanfare. This is in contrast to my HW raid experiences where a single drive failure is almost always handled in a reliable and predictable manner with zero downtime. Your points about having to use a clunky BIOS setup or CLI tools may be true for some controllers, as are your points about needing to maintain a spare of your RAID controller, ongoing driver support, etc. I've found the LSI brand cards have good Linux driver support, CLI tools, an easy to navigate BIOS, and are backwards compatible with RAID sets taken from older cards so I have no problem recommending them. LSI cards, by default, also regularly test all drives to predict failures (avoiding rebuild errors or double failures).
On July 1, 2019 8:56:35 AM CDT, Blake Hudson blake@ispn.net wrote:
Warren Young wrote on 6/28/2019 6:53 PM:
On Jun 28, 2019, at 8:46 AM, Blake Hudson blake@ispn.net wrote:
Linux software RAID…has only decreased availability for me. This has
been due to a combination of hardware and software issues that are are generally handled well by HW RAID controllers, but are often handled poorly or unpredictably by desktop oriented hardware and Linux software.
Would you care to be more specific? I have little experience with
software RAID, other than ZFS, so I don’t know what these “issues” might be.
I've never used ZFS, as its Linux support has been historically poor. My comments are limited to mdadm. I've experienced three faults when using
Linux software raid (mdadm) on RH/RHEL/CentOS and I believe all of them
resulted in more downtime than would have been experienced without the RAID: 1) A single drive failure in a RAID4 or 5 array (desktop IDE) caused the entire system to stop responding. The result was a degraded (from the dead drive) and dirty (from the crash) array that could not be rebuilt (either of the former conditions would have been fine, but not both due to buggy Linux software). 2) A single drive failure in a RAID1 array (Supermicro SCSI) caused
the system to be unbootable. We had to update the BIOS to boot from the
working drive and possibly grub had to be repaired or reinstalled as I recall (it's been a long time). 3) A single drive failure in a RAID 4 or 5 array (desktop IDE) was not clearly identified and required a bit of troubleshooting to pinpoint which drive had failed.
Unfortunately, I've never had an experience where a drive just failed cleanly and was marked bad by Linux software RAID and could then be replaced without fanfare. This is in contrast to my HW raid experiences
where a single drive failure is almost always handled in a reliable and
predictable manner with zero downtime. Your points about having to use a clunky BIOS setup or CLI tools may be true for some controllers, as are
your points about needing to maintain a spare of your RAID controller, ongoing driver support, etc. I've found the LSI brand cards have good Linux driver support, CLI tools, an easy to navigate BIOS, and are backwards compatible with RAID sets taken from older cards so I have no
problem recommending them. LSI cards, by default, also regularly test all drives to predict failures (avoiding rebuild errors or double failures).
+1 in favor of hardware RAID.
My usual argument is: in case of hardware RAID dedicated piece of hardware runs a single task: RAID function, which boils down to simple, short, easy to debug well program. In case of software RAID there is no dedicated hardware, and if kernel (big and buggy code) is panicked, current RAID operation will never be finished which leaves the mess. One does not need computer science degree to follow this simple logic.
Valeri
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
On Jul 1, 2019, at 8:26 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
RAID function, which boils down to simple, short, easy to debug well program.
RAID firmware will be harder to debug than Linux software RAID, if only because of easier-to-use tools.
Furthermore, MD RAID only had to be debugged once, rather that once per company-and-product line as with hardware RAID.
I hope you’re not assuming that hardware RAID has no bugs. It’s basically a dedicated CPU running dedicated software that’s difficult to upgrade.
if kernel (big and buggy code) is panicked, current RAID operation will never be finished which leaves the mess.
When was the last time you had a kernel panic? And of those times, when was the last time it happened because of something other than a hardware or driver fault? If it wasn’t for all this hardware doing strange things, the kernel would be a lot more stable. :)
You seem to be saying that hardware RAID can’t lose data. You’re ignoring the RAID 5 write hole:
https://en.wikipedia.org/wiki/RAID#WRITE-HOLE
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
Copy-on-write filesystems like ZFS and btrfs avoid the write hole entirely, so that the system can crash at any point, and the filesystem is always consistent.
On Mon, 1 Jul 2019, Warren Young wrote:
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
Although batteries have disappeared in favour of NV storage + capacitors, meaning you don't have to replace anything on those models.
jh
On Mon, 1 Jul 2019, Warren Young wrote:
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
Although batteries have disappeared in favour of NV storage + capacitors, meaning you don't have to replace anything on those models.
That's what you think before you have to replace the capacitors module :-)
Simon
You seem to be saying that hardware RAID can’t lose data. You’re ignoring the RAID 5 write hole:
https://en.wikipedia.org/wiki/RAID#WRITE-HOLE
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
Yes. Furthermore, with the huge capacity disks in use today, rebuilding a RAID 5 array after a disk fails, with all the necessary parity calculations, can take days. RAID 5 is obsolete, and I'm not the only one saying it.
You seem to be saying that hardware RAID can’t lose data. You’re ignoring the RAID 5 write hole:
https://en.wikipedia.org/wiki/RAID#WRITE-HOLE
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
Yes. Furthermore, with the huge capacity disks in use today, rebuilding a RAID 5 array after a disk fails, with all the necessary parity calculations, can take days. RAID 5 is obsolete, and I'm not the only one saying it.
Needless to say hardware and software RAID have the problem above.
Simon
On 2019-07-01 10:01, Warren Young wrote:
On Jul 1, 2019, at 8:26 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
RAID function, which boils down to simple, short, easy to debug well program.
I didn't intend to start software vs hardware RAID flame war when I joined somebody's else opinion.
Now, commenting with all due respect to famous person who Warren Young definitely is.
RAID firmware will be harder to debug than Linux software RAID, if only because of easier-to-use tools.
I myself debug neither firmware (or "microcode", speaking the language as it was some 30 years ago), not Linux kernel. In both cases it is someone else who does the debugging.
You are speaking as the person who routinely debugs Linux components. I still have to stress, that in debugging RAID card firmware one has small program which this firmware is.
In the case of debugging EVERYTHING that affects reliability of software RAID, on has to debug the following:
1. Linux kernel itself, which is huge;
2. _all_ the drivers that are loaded when system runs. Some of the drivers on one's system may be binary only, like NVIDIA video card drives. So, even for those who like Warren can debug all code, these still are not accessible.
All of the above can potentially panic kernel (as they all run in kernel context), so they all affect reliability of software RAID, not only the chunk of software doing software RAID function.
Furthermore, MD RAID only had to be debugged once, rather that once per company-and-product line as with hardware RAID.
Alas, MD RAID itself not the only thing that affects reliability of software RAID. Panicking kernel has grave effects on software RAID, so anything that can panic kernel had also to be debugged same thoroughly. And it always have to be redone once changed to kernel or drivers are introduced.
I hope you’re not assuming that hardware RAID has no bugs. It’s basically a dedicated CPU running dedicated software that’s difficult to upgrade.
That's true, it is dedicated CPU running dedicated program, and it keeps doing it even if the operating system crashed. Yes, hardware itself can be unreliable. But in case of RAID card it is only the card itself. Failure rate of which in my racks is much smaller that overall failure rate of everything. In case of kernel panic, any piece of hardware inside computer in some mode of failure can cause it.
One more thing: apart from hardware RAID "firmware" program being small and logically simple, there is one more factor: it usually runs on RISC architecture CPU, and introduce bugs programming for RISC architecture IMHO is more difficult that when programming for i386 and amd64 architectures. Just my humble opinion I carry since the time I was programming.
if kernel (big and buggy code) is panicked, current RAID operation will never be finished which leaves the mess.
When was the last time you had a kernel panic? And of those times, when was the last time it happened because of something other than a hardware or driver fault? If it wasn’t for all this hardware doing strange things, the kernel would be a lot more stable. :)
Yes, I half expected that. When did we last have kernel crash, and who of us is unable to choose reliable hardware, and unable to insist that our institution pays mere 5-10% higher price for reliable box than they would for junk hardware? Indeed, we all run reliable boxes, and I am retiring still reliably working machines of age 10-13 years...
However, I would rather suggest to compare not absolute probabilities, which, exactly as you said, are infinitesimal. But with relative probabilities, I still will go with hardware RAID.
You seem to be saying that hardware RAID can’t lose data. You’re ignoring the RAID 5 write hole:
https://en.wikipedia.org/wiki/RAID#WRITE-HOLE
Neither of our RAID cards runs without battery backup.
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
You are absolutely right about system with hardware RAID being more expensive than that with software RAID. I would say, for "small scale big storage" boxes (i.e. NOT distributed file systems), hardware RAID adds about 5-7% of cost in our case. Now, with hardware RAID all maintenance (what one needs to do in case of single failed drive replacement routine) takes about 1/10 of a time necessary do deal with similar failure in case of software RAID. I deal with both, as it historically happened, so this is my own observation. Maybe software RAID boxes I have to deal with are too messy (imagine almost two dozens of software RAIDs 12-16 drives each on one machine; even bios runs out of numbers in attempt to enumerate all drives...) No, I am not taking the blame for building box like that ;-)
All in all, simpler way of routinely dealing with hardware RAID saves human time involved, and in a long run quite likely is money saving (think of salaries, benefits etc.), though it looks more expensive at the moment of hardware purchase.
Copy-on-write filesystems like ZFS and btrfs avoid the write hole entirely, so that the system can crash at any point, and the filesystem is always consistent.
Well, yes, zfs is wholly different dimension, and neither of my comments meant to be made in presence of zfs ;-)
I guess, our discussions (this, I'm sure, in not the first one) leaves each of us with our own opinions unchanged, so this will be my last set of comments on this thread; whoever reads it will be able to use one's own brain and arrive at one's own conclusions.
With a hope this helps somebody,
Valeri
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Jul 1, 2019, at 10:10 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On 2019-07-01 10:01, Warren Young wrote:
On Jul 1, 2019, at 8:26 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
RAID function, which boils down to simple, short, easy to debug well program.
I didn't intend to start software vs hardware RAID flame war
Where is this flame war you speak of? I’m over here having a reasonable discussion. I’ll continue being reasonable, if that’s all right with you. :)
Now, commenting with all due respect to famous person who Warren Young definitely is.
Since when? I’m not even Internet Famous.
RAID firmware will be harder to debug than Linux software RAID, if only because of easier-to-use tools.
I myself debug neither firmware (or "microcode", speaking the language as it was some 30 years ago)
There is a big distinction between those two terms; they are not equivalent terms from different points in history. I had a big digression explaining the difference, but I’ve cut it as entirely off-topic.
It suffices to say that with hardware RAID, you’re almost certainly talking about firmware, not microcode, not just today, but also 30 years ago. Microcode is a much lower level thing than what happens at the user-facing product level of RAID controllers.
In both cases it is someone else who does the debugging.
If it takes three times as much developer time to debug a RAID card firmware as it does to debug Linux MD RAID, and the latter has to be debugged only once instead of multiple times as the hardware RAID firmware is reinvented again and again, which one do you suppose ends up with more bugs?
You are speaking as the person who routinely debugs Linux components.
I have enough work fixing my own bugs that I rarely find time to fix others’ bugs. But yes, it does happen once in a while.
- Linux kernel itself, which is huge;
…under which your hardware RAID card’s driver runs, making it even more huge than it was before that driver was added.
You can’t zero out the Linux kernel code base size when talking about hardware RAID. It’s not like the card sits there and runs in a purely isolated environment.
It is a testament to how well-debugged the Linux kernel is that your hardware RAID card runs so well!
All of the above can potentially panic kernel (as they all run in kernel context), so they all affect reliability of software RAID, not only the chunk of software doing software RAID function.
When the kernel panics, what do you suppose happens to the hardware RAID card? Does it keep doing useful work, and if so, for how long?
What’s more likely these days: a kernel panic or an unwanted hardware restart? And when that happens, which is more likely to fail, a hardware RAID without BBU/NV storage or a software RAID designed to be always-consistent?
I’m stripping away your hardware RAID’s advantage in NV storage to keep things equal in cost: my on-board SATA ports for your stripped-down hardware RAID card. You probably still paid more, but I’ll give you that, since you’re using non-commodity hardware.
Now that they’re on even footing, which one is more reliable?
hardware RAID "firmware" program being small and logically simple
You’ve made an unwarranted assumption.
I just did a blind web search and found this page:
https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361...
…on which we find that the RAID firmware for the card is 4.1 MB, compressed.
Now, that’s considered a small file these days, but realize that there are no 1024 px² icon files in there, no massive XML libraries, no language internationalization files, no high-level language runtimes… It’s just millions of low-level highly-optimized CPU instructions.
From experience, I’d expect it to take something like 5-10 person-years to reproduce that much code.
That’s far from being “small and logically simple.”
it usually runs on RISC architecture CPU, and introduce bugs programming for RISC architecture IMHO is more difficult that when programming for i386 and amd64 architectures.
I don’t think I’ve seen any such study, and if I did, I’d expect it to only be talking about assembly language programming.
Above that level, you’re talking about high-level language compilers, and I don’t think the underlying CPU architecture has anything to do with the error rates in programs written in high-level languages.
I’d expect RAID firmware to be written in C, not assembly language, which means the CPU the has little or nothing to do with programmer error rates.
Thought experiment: does Linux have fewer bugs on ARM than on x86_64?
I even doubt that you can dig up a study showing that assembly language programming on CISC is significantly more error-prone than RISC programming in the first place. My experience says that error rates in programs are largely a function of the number of lines of code, and that puts RISC at a severe disadvantage. For ARM vs x86, the instruction ratio is roughly 3:1 for equivalent user-facing functionality.
There are many good reasons why the error rate in programs should be so strongly governed by lines of code:
1. More LoC is more chances for typos and logic errors.
2. More LoC means a smaller proportion of the solution fits on the screen at once, hiding information from the programmer. Out of sight, out of mind.
3. More LoC takes more time to compose and type, so a programmer writing fewer LoC has more time to debug and test, all else being equal.
This is also why almost no one writes in assembly any more, and those who do rarely write *just* in assembly.
On 2019-07-01 10:01, Warren Young wrote:
On Jul 1, 2019, at 8:26 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
RAID function, which boils down to simple, short, easy to debug well program.
I didn't intend to start software vs hardware RAID flame war when I joined somebody's else opinion.
Now, commenting with all due respect to famous person who Warren Young definitely is.
RAID firmware will be harder to debug than Linux software RAID, if only because of easier-to-use tools.
I myself debug neither firmware (or "microcode", speaking the language as it was some 30 years ago), not Linux kernel. In both cases it is someone else who does the debugging.
You are speaking as the person who routinely debugs Linux components. I still have to stress, that in debugging RAID card firmware one has small program which this firmware is.
In the case of debugging EVERYTHING that affects reliability of software RAID, on has to debug the following:
Linux kernel itself, which is huge;
_all_ the drivers that are loaded when system runs. Some of the
drivers on one's system may be binary only, like NVIDIA video card drives. So, even for those who like Warren can debug all code, these still are not accessible.
All of the above can potentially panic kernel (as they all run in kernel context), so they all affect reliability of software RAID, not only the chunk of software doing software RAID function.
Furthermore, MD RAID only had to be debugged once, rather that once per company-and-product line as with hardware RAID.
Alas, MD RAID itself not the only thing that affects reliability of software RAID. Panicking kernel has grave effects on software RAID, so anything that can panic kernel had also to be debugged same thoroughly. And it always have to be redone once changed to kernel or drivers are introduced.
I hope you’re not assuming that hardware RAID has no bugs. It’s basically a dedicated CPU running dedicated software that’s difficult to upgrade.
That's true, it is dedicated CPU running dedicated program, and it keeps doing it even if the operating system crashed. Yes, hardware itself can be unreliable. But in case of RAID card it is only the card itself. Failure rate of which in my racks is much smaller that overall failure rate of everything. In case of kernel panic, any piece of hardware inside computer in some mode of failure can cause it.
One more thing: apart from hardware RAID "firmware" program being small and logically simple, there is one more factor: it usually runs on RISC architecture CPU, and introduce bugs programming for RISC architecture IMHO is more difficult that when programming for i386 and amd64 architectures. Just my humble opinion I carry since the time I was programming.
if kernel (big and buggy code) is panicked, current RAID operation will never be finished which leaves the mess.
When was the last time you had a kernel panic? And of those times, when was the last time it happened because of something other than a hardware or driver fault? If it wasn’t for all this hardware doing strange things, the kernel would be a lot more stable. :)
Yes, I half expected that. When did we last have kernel crash, and who of us is unable to choose reliable hardware, and unable to insist that our institution pays mere 5-10% higher price for reliable box than they would for junk hardware? Indeed, we all run reliable boxes, and I am retiring still reliably working machines of age 10-13 years...
However, I would rather suggest to compare not absolute probabilities, which, exactly as you said, are infinitesimal. But with relative probabilities, I still will go with hardware RAID.
You seem to be saying that hardware RAID can’t lose data. You’re ignoring the RAID 5 write hole:
https://en.wikipedia.org/wiki/RAID#WRITE-HOLE
Neither of our RAID cards runs without battery backup.
If you then bring up battery backups, now you’re adding cost to the system. And then some ~3-5 years later, downtime to swap the battery, and more downtime. And all of that just to work around the RAID write hole.
You are absolutely right about system with hardware RAID being more expensive than that with software RAID. I would say, for "small scale big storage" boxes (i.e. NOT distributed file systems), hardware RAID adds about 5-7% of cost in our case. Now, with hardware RAID all maintenance (what one needs to do in case of single failed drive replacement routine) takes about 1/10 of a time necessary do deal with similar failure in case of software RAID. I deal with both, as it historically happened, so this is my own observation. Maybe software RAID boxes I have to deal with are too messy (imagine almost two dozens of software RAIDs 12-16 drives each on one machine; even bios runs out of numbers in attempt to enumerate all drives...) No, I am not taking the blame for building box like that ;-)
All in all, simpler way of routinely dealing with hardware RAID saves human time involved, and in a long run quite likely is money saving (think of salaries, benefits etc.), though it looks more expensive at the moment of hardware purchase.
It can also be the other way around: If you are a Linux only shop and you have a large number of systems with a large number of different controller brands and generations, you may just start to hate how they all work differently, have their different issues and can really give lots of gray hairs. Doing it all with MD RAID can make your life much easier!
Peoples should also be aware that the firmware of common desktop disks is not optimal for handling errors in RAID configurations. They need different firmware parameters for optimal use in RAID, be it hardware or software.
Regards, Simon
On Jul 1, 2019, at 7:56 AM, Blake Hudson blake@ispn.net wrote:
I've never used ZFS, as its Linux support has been historically poor.
When was the last time you checked?
The ZFS-on-Linux (ZoL) code has been stable for years. In recent months, the BSDs have rebased their offerings from Illumos to ZoL. The macOS port, called O3X, is also mostly based on ZoL.
That leaves Solaris as the only major OS with a ZFS implementation not based on ZoL.
1) A single drive failure in a RAID4 or 5 array (desktop IDE)
Can I take by “IDE” that you mean “before SATA”, so you’re giving a data point something like twenty years old?
2) A single drive failure in a RAID1 array (Supermicro SCSI)
Another dated tech reference, if by “SCSI” you mean parallel SCSI, not SAS.
I don’t mind old tech per se, but at some point the clock on bugs must reset.
We had to update the BIOS to boot from the working drive
That doesn’t sound like a problem with the Linux MD raid feature. It sounds like the system BIOS had a strange limitation about which drives it was willing to consider bootable.
and possibly grub had to be repaired or reinstalled as I recall
That sounds like you didn’t put GRUB on all disks in the array, which in turn means you probably set up the RAID manually, rather than through the OS installer, which should take care of details like that for you.
3) A single drive failure in a RAID 4 or 5 array (desktop IDE) was not clearly identified and required a bit of troubleshooting to pinpoint which drive had failed.
I don’t know about Linux MD RAID, but with ZFS, you can make it tell you the drive’s serial number when it’s pointing out a faulted disk.
Software RAID also does something that I haven’t seen in typical PC-style hardware RAID: marry GPT partition drive labels to array status reports, so that instead of seeing something that’s only of indirect value like “port 4 subunit 3” you can make it say “left cage, 3rd drive down”.
I haven't been following this thread closely, but some of them have left me puzzled.
1. Hardware RAID: other than Rocket RAID, who don't seem to support a card more than about 3 years (i used to have to update and rebuild the drivers), anything LSI based, which includes Dell PERC, have been pretty good. The newer models do even better at doing the right thing.
2. ZFS seems to be ok, though we were testing it with an Ubuntu system just a month or so ago. Note: ZFS with a zpoolZ2 - the equivalent of RAID 6, which we set up using the LSI card set to JBOD - took about 3 days and 8 hours for backing up a large project, while the same o/s, but with xfs on an LSI-hardware RAID 6, took about 10 hours less. Hardware RAID is faster.
3. Being in the middle of going through three days of hourly logs and the loghost reports, and other stuff, from the weekend (> 600 emails), I noted that we have something like 50 mdraids, and we've had very little trouble with them, almost all are either RAID 1 or RAID 6 (we may have a RAID 5 left), except for the system that had a h/d fail, and another starting to through errors (I suspect the server itself...). The biggest issue for me is that when one fails, "identify" rarely works, which means use smartctl or MegaCli64 (or the lsi script) to find the s/n of the drive, then guess... and if that doesn't work, bring the system down to find the right bloody bad drive. But... they rebuild, no problems.
Oh, and I have my own workstation at home on a mdraid 1.
mark
Speaking of ZFS, got a weird one: we were testing ZFS (ok, it was on Ubuntu, but that shouldn't make a difference, I would think). and I've got a zpool z2. I pulled one drive, to simulate a drive failure, and it rebuilt with the hot spare. Then I pushed the drive I'd pulled back in... and it does not look like I've got a hot spare. zpool status shows config:
NAME STATE READ WRITE CKSUM export1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sda ONLINE 0 0 0 spare-1 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdl ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 spares sdl INUSE currently in use
Does anyone know what I need to do to make the spare sdl back to being just a hot spare?
mark
On Jul 1, 2019, at 9:44 AM, mark m.roth@5-cent.us wrote:
it was on Ubuntu, but that shouldn't make a difference, I would think
Indeed not. It’s been years since the OS you were using implied a large set of OS-specific ZFS features.
There are still differences among the implementations, but the number of those is getting smaller as the community converges on ZoL as the common base.
Over time, the biggest difference among ZFS implementations will be time-based: a ZFS pool created in 2016 will have fewer feature flags than one created in 2019, so the 2019 pool won’t import on older OSes.
I pulled one drive, to simulate a drive failure, and it rebuilt with the hot spare. Then I pushed the drive I'd pulled back in... and it does not look like I've got a hot spare. zpool status shows config:
I think you’re expecting more than ZFS tries to deliver here. Although it’s filesystem + RAID + volume manager, it doesn’t also include storage device management features.
If you need this kind of thing to just happen automagically, you probably want to configure zed:
https://zfsonlinux.org/manpages/0.8.0/man8/zed.8.html
But, if you can spare human cycles to deal with it, you don’t need zed.
What’s happened here is that you didn’t tell ZFS that the disk is no longer part of the pool, so that when it came back, ZFS says, “Hey, I recognize that disk! It belonged to me once. It must be mine again.” But then it goes and tries to fit it into the pool and finds that there are no gaps to stick it into.
So, one option is to remove that replaced disk from the pool, then reinsert it as the new hot spare:
$ sudo zpool remove export1 sdb $ sudo zpool add export1 spare sdb
The first command removes the ZFS header info from the disk, and the second puts it back on, marking it as a spare.
Alternately, you can relieve your prior hot spare (sdl) from its new duty — “new sdb” — putting sdb back in its prior place:
$ sudo zpool replace export1 sdl sdb
That does a full resilver of the replacement disk, a cost you already paid for with the hot spare failover, but it does have the advantage of keeping the disks in alphabetical order by /dev name, as you’d probably expect.
But, rather than get exercised about whether putting sdl between sda and sdc makes sense, I’d strongly encourage you to get away from raw /dev/sd? names. The fastest path in your setup to logical device names is:
$ sudo zpool export export1 $ sudo zpool import -d /dev/disk/by-serial export1
All of the raw /dev/sd? names will change to /dev/disk/by-serial/* names, which I find to be the most convenient form for determining which disk is which when swapping out failed disks. It doesn’t take a very smart set of remote “hands” at a site to read serial numbers off of disks to determine which is the faulted disk.
The main problem with that scheme is that pulling disks to read their labels works best with the pool exported. If you want to be able to do device replacement with the pool online, you need some way to associate particular disks with their placement in the server’s drive bays.
To get there, you’d have to be using GPT-partitioned disks. ZFS normally does that these days, creating one big partition that’s optimally-aligned, which you can then label with gdisk’s “c” command.
Having done that, then you can do “zfs import -d /dev/disk/by-partlabel” instead, which gets you the logical disk naming scheme I’ve spoken of twice in the other thread.
If you must use whole-disk vdevs, then I’d at least write the last few digits of each drive’s serial number on the drive cage or the end of the drive itself, so you can just tell the tech “remove the one marked ab212”.
Note by the way that all of this happened because you reintroduced a ZFS-labeled disk into the pool. That normally doesn’t happen. Normally, a replacment is a brand new disk, without any ZFS labeling on it, so you’d jump straight to the “zpool add” step. The prior hot spare took over, so now you’re just giving the pool a hot spare again.
On 2019-07-01 10:10, mark wrote:
I haven't been following this thread closely, but some of them have left me puzzled.
- Hardware RAID: other than Rocket RAID, who don't seem to support a card
more than about 3 years (i used to have to update and rebuild the drivers), anything LSI based, which includes Dell PERC, have been pretty good. The newer models do even better at doing the right thing.
- ZFS seems to be ok, though we were testing it with an Ubuntu system
just a month or so ago. Note: ZFS with a zpoolZ2 - the equivalent of RAID 6, which we set up using the LSI card set to JBOD - took about 3 days and 8 hours for backing up a large project, while the same o/s, but with xfs on an LSI-hardware RAID 6, took about 10 hours less. Hardware RAID is faster.
- Being in the middle of going through three days of hourly logs and the
loghost reports, and other stuff, from the weekend (> 600 emails), I noted that we have something like 50 mdraids, and we've had very little trouble with them, almost all are either RAID 1 or RAID 6 (we may have a RAID 5 left), except for the system that had a h/d fail, and another starting to through errors (I suspect the server itself...). The biggest issue for me is that when one fails, "identify" rarely works, which means use smartctl or MegaCli64 (or the lsi script) to find the s/n of the drive, then guess... and if that doesn't work, bring the system down to find the right bloody bad drive.
In my case I spend a bit of time before I roll out the system, so I know which physical drive (or which tray) the controller numbers with which number. They stay the same over the life of the system, those are just physical connections. Then when the controller tells drive number "N" failed, I know which tray to pull.
Valeri
But... they rebuild, no problems.
Oh, and I have my own workstation at home on a mdraid 1.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Jul 1, 2019, at 9:10 AM, mark m.roth@5-cent.us wrote:
ZFS with a zpoolZ2
You mean raidz2.
which we set up using the LSI card set to JBOD
Some LSI cards require a complete firmware re-flash to get them into “IT mode” which completely does away with the RAID logic and turns them into dumb SATA controllers. Consequently, you usually do this on the lowest-end models, since there’s no point paying for the expensive RAID features on the higher-end cards when you do this.
I point this out because there’s another path, which is to put each disk into a single-target “JBOD”, which is less efficient, since it means each disk is addressed indirectly via the RAID chipset, rather than as just a plain SATA disk.
You took the first path, I hope?
We gave up on IT-mode LSI cards when motherboards with two SFF-8087 connectors became readily available, giving easy 8-drive arrays. No need for the extra board any more.
took about 3 days and 8 hours for backing up a large project, while the same o/s, but with xfs on an LSI-hardware RAID 6, took about 10 hours less. Hardware RAID is faster.
I doubt the speed difference is due to hardware vs software. The real difference you tested there is ZFS vs XFS, and you should absolutely expect to pay some performance cost with ZFS. You’re getting a lot of features in trade.
I wouldn’t expect the difference to be quite that wide, by the way. That brings me back to my guess about IT mode vs RAID JBOD mode on your card.
Anyway, one of those compensating benefits are snapshot-based backups.
Before starting the first backup, set a ZFS snapshot. Do the backup with a “zfs send” of the snapshot, rather than whatever file-level backup tool you were using before. When that completes, create another snapshot and send *that* snapshot. This will complete much faster, because ZFS uses the two snapshots to compute the set of changed blocks between the two snapshots and sends only the changed blocks.
This is a sub-file level backup, so that if a 1 kB header changes in a 2 GB data file, you send only one block’s worth of data to the backup server, since you’ll be using a block size bigger than 1 kB, and that header — being a *header* — won’t straddle two blocks. This is excellent for filesystems with large files that change in small areas, like databases.
You might say, “I can do that with rsync already,” but with rsync, you have to compute this delta on each backup, which means reading all of the blocks on *both* sides of the backup. ZFS snapshots keep that information continuously as the filesystem runs, so there is nothing to compute at the beginning of the backup.
rsync’s delta compression primarily saves time only when the link between the two machines is much slower than the disks on either side, so that the delta computation overhead gets swamped by the bottleneck’s delays.
With ZFS, the inter-snapshot delta computation is so fast that you can use it even when you’ve got two servers sitting side by side with a high-bandwidth link between them.
Once you’ve got a scheme like this rolling, you can do backups very quickly, possibly even sub-minute.
And you don’t have to script all of this yourself. There are numerous pre-built tools to automate this. We’ve been happy users of Sanoid, which does both the automatic snapshot and automatic replication parts:
https://github.com/jimsalterjrs/sanoid
Another nice thing about snapshot-based backups is that they’re always consistent: just as you can reboot a ZFS based system at any time and have it reboot into a consistent state, you can take a snapshot and send it to another machine, and it will be just as consistent.
Contrast something like rsync, which is making its decisions about what to send on a per-file basis, so that it simply cannot be consistent unless you stop all of the apps that can write to the data store you’re backing up.
Snapshot based backups can occur while the system is under a heavy workload. A ZFS snapshot is nearly free to create, and once set, it freezes the data blocks in a consistent state. This benefit falls out nearly for free with a copy-on-write filesystem.
Now that you’re doing snapshot-based backups, you’re immune to crypto malware, as long as you keep your snapshots long enough to cover your maximum detection window. Someone just encrypted all your stuff? Fine, roll it back. You don’t even have to go to the backup server.
when one fails, "identify" rarely works, which means use smartctl or MegaCli64 (or the lsi script) to find the s/n of the drive, then guess…
It’s really nice when you get a disk status report and the missing disk is clear from the labels:
left-1: OK left-2: OK left-4: OK right-1: OK right-2: OK right-3: OK right-4: OK
Hmmm, which disk died, I wonder? Gotta be left-3! No need to guess, the system just told you in human terms, rather than in abstract hardware terms.
Warren Young wrote on 7/1/2019 9:48 AM:
On Jul 1, 2019, at 7:56 AM, Blake Hudson blake@ispn.net wrote:
I've never used ZFS, as its Linux support has been historically poor.
When was the last time you checked?
The ZFS-on-Linux (ZoL) code has been stable for years. In recent months, the BSDs have rebased their offerings from Illumos to ZoL. The macOS port, called O3X, is also mostly based on ZoL.
That leaves Solaris as the only major OS with a ZFS implementation not based on ZoL.
1) A single drive failure in a RAID4 or 5 array (desktop IDE)
Can I take by “IDE” that you mean “before SATA”, so you’re giving a data point something like twenty years old?
2) A single drive failure in a RAID1 array (Supermicro SCSI)
Another dated tech reference, if by “SCSI” you mean parallel SCSI, not SAS.
I don’t mind old tech per se, but at some point the clock on bugs must reset.
Yes, this experience spans decades and a variety of hardware. I'm all for giving things another try, and would love to try ZFS again now that it's been ported to Linux. As far as mdadm goes, I'm happy with LSI hardware RAID controllers and have no desire to retry mdadm at this time. I have enough enterprise class drives fail on a regular basis (I manage a reasonable volume) that the predictability gained by standardizing on one vendor for HW RAID cards is worth a lot. I have no problem recommending LSI cards to folks that feel the improved availability outweighs the cost (~$500). This would assume those folks have already covered other aspects of availability and redundancy first (power, PSUs, cooling, backups, etc).