2009/7/11 oooooooooooo ooooooooooooo hhh735@hotmail.com:
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key?
The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.
The fact that this 1023 file name is unique is very nice. And no trimming is needed! I think you have 2 issues to deal with:
1) you have files with unique file names unfortunatelly with lenth <= 1023 characters. Regarding filenames and paths in linux and ext3 you have:
file name length limit = 254 bytes path length limit = 4096
If you try to store such a file directly, you will break the file name limit. But if you decompose the name into N chunks each of 250 characters, you will be able to preserve the file as a sequence of
N - 1 nested folders plus a file with a name equal to the Nth chunk residing into the N-1th folder.
Via this decomposition you will translate the unique 1023 character 'file name' into a unique 1023 character 'file path' with length lower than the path length limit
2) You suffer performance degradation when number of files in a folder goes beyond 1000.
Filipe Brandenburger has suggested a slick scheme to overcome this problem, that will work perfectly without a database:
============quote start $ echo -n example.txt | md5sum e76faa0543e007be095bb52982802abe -
Then say you take the first 4 digits of it to build the hash: e/7/6/f
Then you store file example.txt at: e/7/6/f/example.txt ============quote end
of course, "example.txt" might be a long filename: "exaaaaa ..... 1000 chars here .....txt" so after the "hash tree" e/7/6/f you will store the file path structure described in 1).
As was suggested by Les Mikesell, squid and other products have already implemented similar strategies, and you might be able to use either the algorithm or directly the code that implements it. I would spend some time investigating squid's code. I think squid has to deal with exactly same problem - cache the contents of resources whose urls might be > 254 characters.
If you use this approach - no need for a database to store hashes!
I did some tests on a Centos 3 system with the following script:
=====================script start #! /bin/bash
for a in a b c d e f g j; do f="" for i in `seq 1 250`; do f=$a$f done mkdir $f cd $f done pwd > some_file.txt =====================script end
which creates a nested directory structure with and a file in it. Total file path length is > 8 * 250. I had no problems accessing this file by its full path:
$ find ./ -name some* -exec cat {} ; | wc -c 2026