[CentOS] Filesystem that doesn't store duplicate data
Ross S. W. Walker
rwalker at medallion.com
Thu Dec 6 05:21:51 UTC 2007
Ruslan Sivak wrote:
>
> Peter Arremann wrote:
> > On Wednesday 05 December 2007, redhat at mckerrs.net wrote:
> >
> >> You'd think that using this technology on a live
> filesystem could incur a
> >> significant performance penalty due to all those
> calculations (fuse module
> >> anyone ?). Imagine a hardware optimized data de-duplication disk
> >> controller, similar to raid XOR optimized cpus. Now that
> would be cool. All
> >> it would need to store was meta-data when it had already
> seen the exact
> >> same block. I think fundamentally it is similar in result
> to on the fly
> >> disk compression.
> >>
> >
> > Actually, the impact - if the filesystem is designed
> correctly - shouldn't be
> > that horrible. After all, Sun has managed to integrate
> checksums into ZFS and
> > still get great performance. In addition, ZFS doesn't
> directly overwrite data
> > but uses a new datablock each time...
> >
> > What you would have to do then is keep a lookup table with
> the checksums to
> > find possible matches quickly. Then when you find one, do
> another compare to
> > be 100% sure you didn't have a collision on your checksums.
> If that works,
> > then you can reference that datablock.
> >
> > It is still a lot of work, but as sun showed, on the fly
> compares and
> > checksums are doable without too much of a hit.
> >
> > Peter.
> >
> >
> >
> I'm not very knowledgeable on how filesystems work. Is there
> a primer I
> can brush up on somewhere? I'm thinking about implementing a
> proof of
> concept using Java and Fuse.
How about a FUSE file system (userland, ie NTFS 3G) that layers
on top of any file system that supports hard links, intercepts
the FS API and stores all files in a hidden directory and names
them after their MD5 hash and hard links to the file name in
the user directory stucture. When the # of links drops to 1
then the hash is removed, when new files are copied in if the
hash collides with an existing one the data is discarded and
only a hard link is made.
Of course it will be a little more involved then this, but the
idea is to keep it really simple so it's less likely to break.
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
More information about the CentOS
mailing list