This goes out to you admins who manage servers with a heavy load of information.
I would like to know what you do about the number of files in a folder, or if that is a concern. I think there is a limitation or a slow down if it gets to big, but what is optimal (if necessary)
Example- running a website that allows a user to upload some photos (small ones). You get lets say 300,000 users each uploading 10 photos. That's 3 million files.
Storing that in one folder would seem like it would cause an issue when using that folder, is that right?
If it does, what do you do about that? How do you handle things?
If you have 300,000 clients you could give them their own folder each and then the folders would have only 10 photos, but one folder would contain 300,000 folders.
SO what is best for file management and system resources?
-Thanks Bob
Bob Hoffman wrote:
This goes out to you admins who manage servers with a heavy load of information.
I would like to know what you do about the number of files in a folder, or if that is a concern. I think there is a limitation or a slow down if it gets to big, but what is optimal (if necessary)
Example- running a website that allows a user to upload some photos (small ones). You get lets say 300,000 users each uploading 10 photos. That's 3 million files.
Storing that in one folder would seem like it would cause an issue when using that folder, is that right?
If it does, what do you do about that? How do you handle things?
If you have 300,000 clients you could give them their own folder each and then the folders would have only 10 photos, but one folder would contain 300,000 folders.
SO what is best for file management and system resources?
Using hash_index on ext3 or a hashing file system helps... but in many such contexts, I've found if you can do a multi-level directory hashing scheme (compute some reproducible hash on a file name or user name/ID) and index into a directory structure, this can help. -Alan
I would like to know what you do about the number of files in a folder, or if that is a concern. I think there is a limitation or a slow down if it gets to big, but what is optimal (if necessary)
SO what is best for file management and system resources?
Using hash_index on ext3 or a hashing file system helps... but in many such contexts, I've found if you can do a multi-level directory hashing scheme (compute some reproducible hash on a file name or user name/ID) and index into a directory structure, this can help. -Alan
I set up using an ext3 and with centos I believe that is 4blocks which has a 8tb size limit overall. However, I believe that is per logical drive. Also, the number of total files per logical drive is some strange formula or Volume size divided by 2 to the 23rd power...but not sure, it may be size /2 and then to the 23rd power. That is a lot of files I think. 32,000 is the max sub directory count for a directory.
I am going with a max size of 1000 files per folder and a max sub directory for, let's say an image folder, of 10,000. I think this will keep the application I am building fine with most computers.
For my own sites, I think when approacing a huge volume it will be time to just get some bigger drives with a different file system to host those specific directories and that should solve it all.
The only way, I can see, to not slow the computer down is limit to number of files in a directory and number of folders in a directory (such as no more than 1000 1st tier sub directories in the image folder. And tying to make sure it is eiter folders or files in a folder, not both should help.
OF course it would be nice to be able to benchmark the process by number of files, sub directories, and files per sub directory....there might be a way.
I think that is the only way to handle it, at least in a small system without large drives and using ext3.
Thanks for all the input.
On Thu, Jul 09, 2009 at 01:04:37PM -0400, Bob Hoffman wrote:
If you have 300,000 clients you could give them their own folder each and then the folders would have only 10 photos, but one folder would contain 300,000 folders.
No. Because that top level folder would be split by first letter, or by first and second letter eg "fred" would be f/r/fred (or f/r/ed) "harry" would be h/a/harry (or h/a/rry)
If you find there's too much clumping then (eg you have a lot of people beginning "fr"), you hash the name instead, and then split on the hash. (A simple hash could just be an incrementing number - "userid").
Then you simply program the web server to automatically convert from friendly name to split (or split hash'd) name. So it _looks_ like everyone has names like "fred" and "harry" but your directory structure is a lot more efficient.
SO what is best for file management and system resources?
Best is subjective. I've just described _one_ method.
On Thu, 2009-07-09 at 13:04 -0400, Bob Hoffman wrote:
SO what is best for file management and system resources?
--- Looking at Case Studies from former companies and how they did it. Then follow there solution or make it better.
If that company can give you POC (proof of concept) for 30 - 90 days just maybe you might have something.
john