RE: [CentOS] Filesystem that doesn't store duplicate data

6 Dec 2007

      Ruslan Sivak wrote:
...
This is a bit different then what I was proposing.  I know 
that backupPC 
already does this on a file level, but I want a filesystem 
that does it 
at a block level.  File level only helps if you're backing up 
multiple 
systems and they all have the same exact files.  Block level 
would help 
a lot more I think.  You'd be able to do a full backup every 
night and 
have it only take up around the same space as a differential backup.  
Things like virtual machine disk images which a lot of times 
are clones 
of each other, could take up only a small additinal amount of 
space for 
each clone, proportional to the changes that are made to that 
disk image.
Well then I would look at backup software that does block-level
de-duplication. Even if the file system did do this, as the
backup software read the files it would re-create the duplicate
unless the backup software was intimately married to the file
system, which makes things a little too proprietary.
You will find that de-duplication can happen on many different
levels here. I was proposing the near-line data at rest, while
the far-line or archival data at rest would be a different
scenario. Near-line needs to be more performance conscious then
far-line.
...
Nobody really answered this, so I'll ask again.  Is there a windows 
version of Fuse?  How does one test a fuse filesystem while 
developing 
it?  Would be nice to just be able to run something from 
eclipse, once 
you've made your changes and have a drive mounted and ready to test.  
Being able to debug a filesystem while it's running would be 
great too.  
Anyone here with experience building Fuse filesystems?
While FUSE is a distinctly Linux development, Windows has had
installable file system filters for a long time. These work a
lot like stackable storage drivers in Linux and is the basis
of a lot of storage tools on Windows including anti-virus
software (as well as rootkits).
Windows does have a de-duplication service that works on the
file level much like what I proposed called the Single Instance
Storage Groveler (I like to call it the single instance storage
mangler :-), and high-end backup software companies have block
level de-duplication options for their software, proprietary
storage appliance companies also have block level de-duplication
for their near and far line storage (big $$$).
...
Ross S. W. Walker wrote:
...
These are all good and valid issues.
Thinking about it some more I might just implement it as a system 
service that scans given disk volumes in the background, keeps a 
hidden directory where it stores it's state information and
hardlinks
...
named after the md5 hash of the files on the volume. If a
collission
...
occurs with an existing md5 hash then the new file is unlinked and 
re-linked to the md5 hash file, if an md5 hash file exists with no 
secondary links then it is removed. Maybe monitor the
journal or use
...
inotify to just get new files and once a week do a full volume scan.
This way the file system performs as well as it normally
does and as
...
things go forward duplicate files are eliminated
(combined). Of course
...
the problem arises of what to do when 1 duplicate is
modified, but the
...
other should remain the same...
Of course what you said about revisions that differ just a little 
won't take advantage of this, but it's file level so it only works 
with whole files, still better then nothing.
-Ross
-----Original Message-----
From: centos-bounces@centos.org centos-bounces@centos.org
To: CentOS mailing list centos@centos.org
Sent: Thu Dec 06 08:10:38 2007
Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
On Thursday 06 December 2007, Ross S. W. Walker wrote:
...
How about a FUSE file system (userland, ie NTFS 3G) that layers
on top of any file system that supports hard links
That would be easy but I can see a few issues with that approach:

On file level rather than block level you're going to be

much more
...
inefficient. I for one have gigabytes of revisions of files
that have
...
changed
a little between each file.

You have to write all datablocks to disk and then erase

them again
...
if you
find a match. That will slow you down and create some weird
behavior. I.e.
...
you know the FS shouldn't store duplicate data, yet you
can't use cp
...
to copy
a 10G file if only 9G are free. If you copy a 8G file, you
see the usage
...
increase till only 1G is free, then when your app closes
the file, you are
...
going to go back to 9G free...

Rather than continuously looking for matches on block level, you

have to
search for matches on files that can be any size. That is
fine if you
...
have a
100K file - but if you have a 100M or larger file, the checksum 
calculations
will take you forever. This means rather than adding a
specific, small
...
penalty to every write call, you add a unknown penalty,
proportional
...
to file
size when closing the file. Also, the fact that most C coders don't 
check the
return code of close doesn't make me happy there...
Peter.
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

RE: [CentOS] Filesystem that doesn't store duplicate data