[CentOS] Filesystem that doesn't store duplicate data

Ross S. W. Walker rwalker at medallion.com
Thu Dec 6 05:21:51 UTC 2007


Ruslan Sivak wrote:
> 
> Peter Arremann wrote:
> > On Wednesday 05 December 2007, redhat at mckerrs.net wrote:
> >   
> >> You'd think that using this technology on a live 
> filesystem could incur a
> >> significant performance penalty due to all those 
> calculations (fuse module
> >> anyone ?). Imagine a hardware optimized data de-duplication disk
> >> controller, similar to raid XOR optimized cpus. Now that 
> would be cool. All
> >> it would need to store was meta-data when it had already 
> seen the exact
> >> same block. I think fundamentally it is similar in result 
> to on the fly
> >> disk compression.
> >>     
> >
> > Actually, the impact - if the filesystem is designed 
> correctly - shouldn't be 
> > that horrible. After all, Sun has managed to integrate 
> checksums into ZFS and 
> > still get great performance. In addition, ZFS doesn't 
> directly overwrite data 
> > but uses a new datablock each time...
> >
> > What you would have to do then is keep a lookup table with 
> the checksums to 
> > find possible matches quickly. Then when you find one, do 
> another compare to 
> > be 100% sure you didn't have a collision on your checksums. 
> If that works, 
> > then you can reference that datablock. 
> >
> > It is still a lot of work, but as sun showed, on the fly 
> compares and 
> > checksums are doable without too much of a hit.
> >
> > Peter.
> >
> >
> >   
> I'm not very knowledgeable on how filesystems work.  Is there 
> a primer I 
> can brush up on somewhere?  I'm thinking about implementing a 
> proof of 
> concept using Java and Fuse. 

How about a FUSE file system (userland, ie NTFS 3G) that layers
on top of any file system that supports hard links, intercepts
the FS API and stores all files in a hidden directory and names
them after their MD5 hash and hard links to the file name in
the user directory stucture. When the # of links drops to 1
then the hash is removed, when new files are copied in if the
hash collides with an existing one the data is discarded and
only a hard link is made.

Of course it will be a little more involved then this, but the
idea is to keep it really simple so it's less likely to break.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.




More information about the CentOS mailing list