[CentOS] Filesystem that doesn't store duplicate data
Ruslan Sivak
rsivak at istandfor.com
Thu Dec 6 05:01:29 UTC 2007
Peter Arremann wrote:
> On Wednesday 05 December 2007, redhat at mckerrs.net wrote:
>
>> You'd think that using this technology on a live filesystem could incur a
>> significant performance penalty due to all those calculations (fuse module
>> anyone ?). Imagine a hardware optimized data de-duplication disk
>> controller, similar to raid XOR optimized cpus. Now that would be cool. All
>> it would need to store was meta-data when it had already seen the exact
>> same block. I think fundamentally it is similar in result to on the fly
>> disk compression.
>>
>
> Actually, the impact - if the filesystem is designed correctly - shouldn't be
> that horrible. After all, Sun has managed to integrate checksums into ZFS and
> still get great performance. In addition, ZFS doesn't directly overwrite data
> but uses a new datablock each time...
>
> What you would have to do then is keep a lookup table with the checksums to
> find possible matches quickly. Then when you find one, do another compare to
> be 100% sure you didn't have a collision on your checksums. If that works,
> then you can reference that datablock.
>
> It is still a lot of work, but as sun showed, on the fly compares and
> checksums are doable without too much of a hit.
>
> Peter.
>
>
>
I'm not very knowledgeable on how filesystems work. Is there a primer I
can brush up on somewhere? I'm thinking about implementing a proof of
concept using Java and Fuse.
Russ
More information about the CentOS
mailing list