[CentOS] Understanding VDO vs ZFS

On Sat, May 2, 2020 at 10:54 PM david <david at daku.org> wrote:
>
> Folks
>
> I'm looking for a solution for backups because ZFS has failed on me
> too many times.  In my environment, I have a large amount of data
> (around 2tb) that I periodically back up.  I keep the last 5
> "snapshots".  I use rsync so that when I overwrite the oldest backup,
> most of the data is already there and the backup completes quickly,
> because only a small number of files have actually changed.
>
> Because of this low change rate, I have used ZFS with its
> deduplication feature to store the data.  I started using a Centos-6
> installation, and upgraded years ago to Centos7.  Centos 8 is on my
> agenda.  However, I've had several data-loss events with ZFS where
> because of a combination of errors and/or mistakes, the entire store
> was lost.  I've also noticed that ZFS is maintained separately from
> Centos.  At this moment, the Centos 8 update causes ZFS to
> fail.  Looking for an alternate, I'm trying VDO.
>
> In the VDO installation, I created a logical volume containing two
> hard-drives, and defined VDO on top of that logical volume.  It
> appears to be running, yet I find the deduplication numbers don't
> pass the smell test.  I would expect that if the logical volume
> contains three copies of essentially identical data, I should see
> deduplication numbers close to 3.00, but instead I'm seeing numbers
> like 1.15.  I compute the compression number as follows:
>   Use df and extract the value for "1k blocks used" from the third column
>   use vdostats --verbose and extract the number titled "1K-blocks used"

I'd like to know what kind of data you're looking to back up (that
will just help get an idea of whether it's even a good fit for dedupe;
though if it dedupes well on ZFS, it probably is fine).  I'd also like
to know how you configured your VDO volume (provide the 'vdo create'
command you used).  As mentioned in some other responses, can you
provide vdostats (full 'vdostats --verbose' output as well as base
'vdostats') and df outputs for this volume?  That would help
understand a bit more on what you're experiencing.

The default deduplication window for a VDO volume is set to ~250G
(--indexMem=0.25).  Assuming you're writing the full 2T of data each
time and want to achieve deduplication across that entire 2T of data,
it would require a "--indexMem=2G" configuration.  You may want to
account for growth as well, which means you may want to consider a
larger amount of memory for the '--indexMem' parameter.  An
alternative, if memory isn't as plentiful, you could enable the sparse
index option to cover a significantly larger dedupe window for a
smaller amount of memory commitment.  There is an additional on-disk
footprint requirement that goes with it.  You can look at the
documentation [0] to find out those specific requirements.  For this
setup, a sparse index with default memory footprint (0.25G) would
cover ~2.5T, but would require an additional ~20G of storage over the
default index configuration.

[0] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/deduplicating_and_compressing_storage/deploying-vdo_deduplicating-and-compressing-storage#vdo-memory-requirements_vdo-requirements

>
> Divide the first by the second.
>
> Can you provide any advice on my use of ZFS or VDO without telling me
> that I should be doing backups differently?

Without more information about what you're attempting to do, I can't
really say that what you're doing is wrong, but I also can't say that
there are any expectations from VDO yet that aren't being met.  More
context would certainly help get to the bottom of this question.

>
> Thanks
>
> David
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
>