[CentOS] Filesystem that doesn't store duplicate data

Thu Dec 6 16:28:56 UTC 2007

Ruslan Sivak wrote:
> 
> This is a bit different then what I was proposing.  I know 
> that backupPC 
> already does this on a file level, but I want a filesystem 
> that does it 
> at a block level.  File level only helps if you're backing up 
> multiple 
> systems and they all have the same exact files.  Block level 
> would help 
> a lot more I think.  You'd be able to do a full backup every 
> night and 
> have it only take up around the same space as a differential backup.  
> Things like virtual machine disk images which a lot of times 
> are clones 
> of each other, could take up only a small additinal amount of 
> space for 
> each clone, proportional to the changes that are made to that 
> disk image. 

Well then I would look at backup software that does block-level
de-duplication. Even if the file system did do this, as the
backup software read the files it would re-create the duplicate
unless the backup software was intimately married to the file
system, which makes things a little too proprietary.

You will find that de-duplication can happen on many different
levels here. I was proposing the near-line data at rest, while
the far-line or archival data at rest would be a different
scenario. Near-line needs to be more performance conscious then
far-line.

> Nobody really answered this, so I'll ask again.  Is there a windows 
> version of Fuse?  How does one test a fuse filesystem while 
> developing 
> it?  Would be nice to just be able to run something from 
> eclipse, once 
> you've made your changes and have a drive mounted and ready to test.  
> Being able to debug a filesystem while it's running would be 
> great too.  
> Anyone here with experience building Fuse filesystems?

While FUSE is a distinctly Linux development, Windows has had
installable file system filters for a long time. These work a
lot like stackable storage drivers in Linux and is the basis
of a lot of storage tools on Windows including anti-virus
software (as well as rootkits).

Windows does have a de-duplication service that works on the
file level much like what I proposed called the Single Instance
Storage Groveler (I like to call it the single instance storage
mangler :-), and high-end backup software companies have block
level de-duplication options for their software, proprietary
storage appliance companies also have block level de-duplication
for their near and far line storage (big $$$).

> Ross S. W. Walker wrote:
> >
> > These are all good and valid issues.
> >
> > Thinking about it some more I might just implement it as a system 
> > service that scans given disk volumes in the background, keeps a 
> > hidden directory where it stores it's state information and 
> hardlinks 
> > named after the md5 hash of the files on the volume. If a 
> collission 
> > occurs with an existing md5 hash then the new file is unlinked and 
> > re-linked to the md5 hash file, if an md5 hash file exists with no 
> > secondary links then it is removed. Maybe monitor the 
> journal or use 
> > inotify to just get new files and once a week do a full volume scan.
> >
> > This way the file system performs as well as it normally 
> does and as 
> > things go forward duplicate files are eliminated 
> (combined). Of course 
> > the problem arises of what to do when 1 duplicate is 
> modified, but the 
> > other should remain the same...
> >
> > Of course what you said about revisions that differ just a little 
> > won't take advantage of this, but it's file level so it only works 
> > with whole files, still better then nothing.
> >
> > -Ross
> >
> >
> > -----Original Message-----
> > From: centos-bounces at centos.org <centos-bounces at centos.org>
> > To: CentOS mailing list <centos at centos.org>
> > Sent: Thu Dec 06 08:10:38 2007
> > Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
> >
> > On Thursday 06 December 2007, Ross S. W. Walker wrote:
> > > How about a FUSE file system (userland, ie NTFS 3G) that layers
> > > on top of any file system that supports hard links
> >
> > That would be easy but I can see a few issues with that approach:
> >
> > 1) On file level rather than block level you're going to be 
> much more
> > inefficient. I for one have gigabytes of revisions of files 
> that have 
> > changed
> > a little between each file.
> >
> > 2) You have to write all datablocks to disk and then erase 
> them again 
> > if you
> > find a match. That will slow you down and create some weird 
> behavior. I.e.
> > you know the FS shouldn't store duplicate data, yet you 
> can't use cp 
> > to copy
> > a 10G file if only 9G are free. If you copy a 8G file, you 
> see the usage
> > increase till only 1G is free, then when your app closes 
> the file, you are
> > going to go back to 9G free...
> >
> > 3) Rather than continuously looking for matches on block level, you 
> > have to
> > search for matches on files that can be any size. That is 
> fine if you 
> > have a
> > 100K file - but if you have a 100M or larger file, the checksum 
> > calculations
> > will take you forever. This means rather than adding a 
> specific, small
> > penalty to every write call, you add a unknown penalty, 
> proportional 
> > to file
> > size when closing the file. Also, the fact that most C coders don't 
> > check the
> > return code of close doesn't make me happy there...
> >
> > Peter.

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.