Ruslan Sivak wrote:
Peter Arremann wrote:
On Wednesday 05 December 2007, redhat@mckerrs.net wrote:
You'd think that using this technology on a live
filesystem could incur a
significant performance penalty due to all those
calculations (fuse module
anyone ?). Imagine a hardware optimized data de-duplication disk controller, similar to raid XOR optimized cpus. Now that
would be cool. All
it would need to store was meta-data when it had already
seen the exact
same block. I think fundamentally it is similar in result
to on the fly
disk compression.
Actually, the impact - if the filesystem is designed
correctly - shouldn't be
that horrible. After all, Sun has managed to integrate
checksums into ZFS and
still get great performance. In addition, ZFS doesn't
directly overwrite data
but uses a new datablock each time...
What you would have to do then is keep a lookup table with
the checksums to
find possible matches quickly. Then when you find one, do
another compare to
be 100% sure you didn't have a collision on your checksums.
If that works,
then you can reference that datablock.
It is still a lot of work, but as sun showed, on the fly
compares and
checksums are doable without too much of a hit.
Peter.
I'm not very knowledgeable on how filesystems work. Is there a primer I can brush up on somewhere? I'm thinking about implementing a proof of concept using Java and Fuse.
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made.
Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.