Hi,
On Wed, Jul 8, 2009 at 17:59, oooooooooooo ooooooooooooohhh735@hotmail.com wrote:
My original idea was storing the file with a hash of it name, and then store a hash->real filename in mysql. By this way I have direct access to the file and I can make a directory hierachy with the first characters of teh hash /c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files would be identically distributed, with around 200 files per dir (waht should not give any perfomance issues). But the requiremenst are to use the real file name for the directory tree, what gives the issue.
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
For instance, let's take "example.txt" as the file name.
Then let's hash it, say using MD5 (just for the sake of example, a simpler hash could give you good enough results and be quicker to calculate): $ echo -n example.txt | md5sum e76faa0543e007be095bb52982802abe -
Then say you take the first 4 digits of it to build the hash: e/7/6/f
Then you store file example.txt at: e/7/6/f/example.txt
The file still has its original name (example.txt), and if you want to find it, you can just calculate the hash for the name again, in which case you will find the e/7/6/f, and prepend that to the original name.
I would also suggest that you keep less directories levels with more branches on them, the optimal performance will be achieved by getting a balance of them. For example, in this case (4 hex digits) you would have 4 levels with 16 entries each. If you group the hex digits two by two, you would have (up to) 256 entries on each level, but only two levels of subdirectories. For instance: example.txt -> e7/6f/example.txt. That might (or might not) give you a better performance. A benchmark should tell you which one is better, but in any case, both of these setups will be many times faster than the one where you have 400,000 files in a single directory.
Would that help solve your issue?
HTH, Filipe