On May 9, 2016, at 12:46 PM, Valeri Galtsev <galtsev at kicp.uchicago.edu> wrote: > > On Mon, May 9, 2016 1:14 pm, Gordon Messmer wrote: >> On 05/09/2016 11:01 AM, Valeri Galtsev wrote: >>> Thanks Gordon! Yes, I know, ZFS, of course. I hear it as you definitely >>> will use zfs for "bricks" of distributed file system, right? >> >> I don't think its use case is limited to >> that. There aren't many spaces where I think you *shouldn't* plan to >> use reliable filesystems (ZFS, btrfs, ReFS). > > For distributed file system "brick" boxes ZFS > (btrfs,...) may be a must, but only if distributed filesystem doesn't have > its own mechanism ensuring file integrity, right? No. ZFS is superior to RAID in many respects, which makes it valuable for any situation where you care about data integrity, even on a desktop PC. ObWarStory: I have a ZFS pool on my desktop PC at home. That pool is composed of two 2-disk mirrors, which makes it kind of like RAID 10. Each mirror is in a separate external Thunderbolt disk enclosure. One of those enclosures started to fail, so I removed both of the raw drives and put them into some cheap USB single-drive enclosures I had on hand. That’s lesson #1: ZFS doesn’t care which bus or controller your drives are on. It doesn’t even care about the OS type or CPU type. As long as the target system supports the features enabled on the ZFS vdev the drive came from, ZFS will happily attach the drives. Because the failing enclosure made me nervous about the state of the data on the raw drives, I ran a ZFS scrub operation. This is similar to the “verify” feature of a good hardware RAID controller, except that because ZFS adds a cryptographically strong hash to every stripe it writes, it can detect every problem a hardware RAID controller plus several others. That’s lesson #2: ZFS scrub beats the pants off RAID verify. This has nothing to do with distributed storage or bricks or anything else. It is purely about data integrity. Of all the storage you manage, which percentage of that do you not care about data integrity? Hours into that scrub operation, I started to see errors! Glad I scrubbed the pool, right? But no worry, ZFS fixed each error. I had no worry that there were undetectable errors, due to the cryptographically-strong hashes used on each block. Not only that, ZFS told me which files the errors affected, so I could test those files at the userspace level, to make sure ZFS’s repairs did the right thing. That’s lesson #3: integrating your disk redundancy and checksumming with the filesystem has tangible benefits. Hardware RAID controllers can’t tell you which files a given stripe belongs to. A few hours further along, the scrub operation’s error counts started spiking. A lot. Like millions of errors. Was the hard drive dying? No, it turned out to be one of the USB disk enclosures. (Yes, it was a bad week at Warren Young Galactic HQ. A certain amount of heart palpitations occurred.) Was the scrub operation scribbling all over my disks? No, and that’s lesson #4: A hardware RAID controller will refuse to return bad blocks in the middle of a file, but if ZFS encounters an unrecoverable error in the middle of a file, that file simply disappears from the filesystem. (To those who think half a file is better than no file, that’s what backups are for.) If ZFS lets you open a file it has reported errors on, it’s fixed the problem already. You don’t have to verify the file byte-by-byte because ZFS scrub already did that. After all this agida, I bought a new 2-disk enclosure with new disks and added those disks to the failing mirror, temporarily turning the 2-way mirror to a 4-way mirror. This let me replicate the failing disks onto the fresh disks in a secure way. I knew that if ZFS finished resilvering that mirror, that I could safely drop the original pair of drives and know — in a cryptographically-strong way — that the new disks had an *exact* copy of the original disks’ data, even if one drive or the other failed to return correct data. That’s lesson #4: A typical hardware RAID version of that scheme would use a 2-disk controller, which means you’d have to swap out one of the disks for a fresh one, temporarily dropping to zero redundancy. The flexibility to add disks to a pool independent of physical connection means I never lost any redundancy. Even in the worst possible case with half the stripes on each disks bad, as long as those stripes zippered together, I could always recover each stripe during the resilver operation. After resilvering the problem mirror, I dropped the two original disks out of the pool, returning the vdev to a 2-way mirror. A subsequent scrub turned up *zero* problems. And that’s lesson #5: even in the face of failing hardware, ZFS will often keep your data safe long enough for you to migrate the data. It doesn’t kick whole drives out of the pool at the first hint of a problem. It will keep trying and trying.