[CentOS] Filesystem that doesn't store duplicate data
Ross S. W. Walker
rwalker at medallion.com
Thu Dec 6 16:28:56 UTC 2007
Ruslan Sivak wrote:
>
> This is a bit different then what I was proposing. I know
> that backupPC
> already does this on a file level, but I want a filesystem
> that does it
> at a block level. File level only helps if you're backing up
> multiple
> systems and they all have the same exact files. Block level
> would help
> a lot more I think. You'd be able to do a full backup every
> night and
> have it only take up around the same space as a differential backup.
> Things like virtual machine disk images which a lot of times
> are clones
> of each other, could take up only a small additinal amount of
> space for
> each clone, proportional to the changes that are made to that
> disk image.
Well then I would look at backup software that does block-level
de-duplication. Even if the file system did do this, as the
backup software read the files it would re-create the duplicate
unless the backup software was intimately married to the file
system, which makes things a little too proprietary.
You will find that de-duplication can happen on many different
levels here. I was proposing the near-line data at rest, while
the far-line or archival data at rest would be a different
scenario. Near-line needs to be more performance conscious then
far-line.
> Nobody really answered this, so I'll ask again. Is there a windows
> version of Fuse? How does one test a fuse filesystem while
> developing
> it? Would be nice to just be able to run something from
> eclipse, once
> you've made your changes and have a drive mounted and ready to test.
> Being able to debug a filesystem while it's running would be
> great too.
> Anyone here with experience building Fuse filesystems?
While FUSE is a distinctly Linux development, Windows has had
installable file system filters for a long time. These work a
lot like stackable storage drivers in Linux and is the basis
of a lot of storage tools on Windows including anti-virus
software (as well as rootkits).
Windows does have a de-duplication service that works on the
file level much like what I proposed called the Single Instance
Storage Groveler (I like to call it the single instance storage
mangler :-), and high-end backup software companies have block
level de-duplication options for their software, proprietary
storage appliance companies also have block level de-duplication
for their near and far line storage (big $$$).
> Ross S. W. Walker wrote:
> >
> > These are all good and valid issues.
> >
> > Thinking about it some more I might just implement it as a system
> > service that scans given disk volumes in the background, keeps a
> > hidden directory where it stores it's state information and
> hardlinks
> > named after the md5 hash of the files on the volume. If a
> collission
> > occurs with an existing md5 hash then the new file is unlinked and
> > re-linked to the md5 hash file, if an md5 hash file exists with no
> > secondary links then it is removed. Maybe monitor the
> journal or use
> > inotify to just get new files and once a week do a full volume scan.
> >
> > This way the file system performs as well as it normally
> does and as
> > things go forward duplicate files are eliminated
> (combined). Of course
> > the problem arises of what to do when 1 duplicate is
> modified, but the
> > other should remain the same...
> >
> > Of course what you said about revisions that differ just a little
> > won't take advantage of this, but it's file level so it only works
> > with whole files, still better then nothing.
> >
> > -Ross
> >
> >
> > -----Original Message-----
> > From: centos-bounces at centos.org <centos-bounces at centos.org>
> > To: CentOS mailing list <centos at centos.org>
> > Sent: Thu Dec 06 08:10:38 2007
> > Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
> >
> > On Thursday 06 December 2007, Ross S. W. Walker wrote:
> > > How about a FUSE file system (userland, ie NTFS 3G) that layers
> > > on top of any file system that supports hard links
> >
> > That would be easy but I can see a few issues with that approach:
> >
> > 1) On file level rather than block level you're going to be
> much more
> > inefficient. I for one have gigabytes of revisions of files
> that have
> > changed
> > a little between each file.
> >
> > 2) You have to write all datablocks to disk and then erase
> them again
> > if you
> > find a match. That will slow you down and create some weird
> behavior. I.e.
> > you know the FS shouldn't store duplicate data, yet you
> can't use cp
> > to copy
> > a 10G file if only 9G are free. If you copy a 8G file, you
> see the usage
> > increase till only 1G is free, then when your app closes
> the file, you are
> > going to go back to 9G free...
> >
> > 3) Rather than continuously looking for matches on block level, you
> > have to
> > search for matches on files that can be any size. That is
> fine if you
> > have a
> > 100K file - but if you have a 100M or larger file, the checksum
> > calculations
> > will take you forever. This means rather than adding a
> specific, small
> > penalty to every write call, you add a unknown penalty,
> proportional
> > to file
> > size when closing the file. Also, the fact that most C coders don't
> > check the
> > return code of close doesn't make me happy there...
> >
> > Peter.
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
More information about the CentOS
mailing list