Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
_________________________________________________________________ Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour...
oooooooooooo ooooooooooooo wrote:
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
I haven't done, or even seen, any recent benchmarks but I'd expect reiserfs to still be the best at that sort of thing. However even if you can improve things slightly, do not let whoever is responsible for that application ignore the fact that it is a horrible design that ignores a very well known problem that has easy solutions. And don't ever do business with someone who would write a program like that again. Any way you approach it, when you want to write a file the system must check to see if the name already exists, and if not, create it in an empty space that it must also find - and this must be done atomically so the directory must be locked against other concurrent operations until the update is complete. If you don't index the contents the lookup is a slow linear scan - if you do, you then have to rewrite the index on every change so you can't win. Sensible programs that expect to access a lot of files will build a tree structure to break up the number that land in any single directory (see squid for an example). Even more sensible programs would re-use some existing caching mechanism like squid or memcached instead of writing a new one badly.
On 7/8/09 8:56 AM, "Les Mikesell" lesmikesell@gmail.com wrote:
oooooooooooo ooooooooooooo wrote:
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
I haven't done, or even seen, any recent benchmarks but I'd expect reiserfs to still be the best at that sort of thing. However even if you can improve things slightly, do not let whoever is responsible for that application ignore the fact that it is a horrible design that ignores a very well known problem that has easy solutions. And don't ever do business with someone who would write a program like that again. Any way you approach it, when you want to write a file the system must check to see if the name already exists, and if not, create it in an empty space that it must also find - and this must be done atomically so the directory must be locked against other concurrent operations until the update is complete. If you don't index the contents the lookup is a slow linear scan - if you do, you then have to rewrite the index on every change so you can't win. Sensible programs that expect to access a lot of files will build a tree structure to break up the number that land in any single directory (see squid for an example). Even more sensible programs would re-use some existing caching mechanism like squid or memcached instead of writing a new one badly.
In many ways this is similar to issues you'll see in a very active mail or news server that uses maildir wherein the d-entries get too large to be traversed quickly. The only way to deal with it (especially if the application adds and removes these files regularly) is to every once in a while copy the files to another directory, nuke the directory and restore from the copy. This is why databases are better for this kind of intensive data caching.
On Wed, Jul 8, 2009 at 2:27 AM, oooooooooooo ooooooooooooo < hhh735@hotmail.com> wrote:
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
I saw this article some time back.
http://www.linux.com/archive/feature/127055
I've not implemented it, but from past experience, you may lose some performance initially, but the database fs performance might be more consistent as the number of files grow.
Perhaps think about running tune2fs maybe also consider adding noatime
Yes, I added it and I got a perfomance increase, anyway as the number of fields grows the speed keeps going below an acceptable level.
I saw this article some time back.
http://www.linux.com/archive/feature/127055 Good idea, I already use mysql for indexing the files, so everytime I need to make a lookup I don't need the entire dir and then get the file, anyway my requirements are keeping the files on disk.
The only way to deal with it (especially if the
application adds and removes these files regularly) is to every once in a while copy the files to another directory, nuke the directory and restore from the copy.Thanks, but there will not be too many file updates once the cache is done, so recreating directories can not be very helpful here. The issue is that as the number of files grows, bot reads from existing files and new insertion gets slower and slower.
I haven't done, or even seen, any recent benchmarks but I'd expect
reiserfs to still be the best at that sort of thing. I've looking at some benchmarks and reiser seems a bit faster in my scenario, however my problem happens when I have a arge number of files, for what I have seen, I'm not sure if reiser would be a fix....
However even if
you can improve things slightly, do not let whoever is responsible for that application ignore the fact that it is a horrible design that ignores a very well known problem that has easy solutions.My original idea was storing the file with a hash of it name, and then store a hash->real filename in mysql. By this way I have direct access to the file and I can make a directory hierachy with the first characters of teh hash /c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files would be identically distributed, with around 200 files per dir (waht should not give any perfomance issues). But the requiremenst are to use the real file name for the directory tree, what gives the issue.
Did that program also write your address header ?
:)
Thanks for the help.
----------------------------------------
From: hhh735@hotmail.com To: centos@centos.org Date: Wed, 8 Jul 2009 06:27:40 +0000 Subject: [CentOS] Question about optimal filesystem with many small files.
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour... _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
_________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now! http://www.live.com/getstarted.aspx
Hi,
On Wed, Jul 8, 2009 at 17:59, oooooooooooo ooooooooooooohhh735@hotmail.com wrote:
My original idea was storing the file with a hash of it name, and then store a hash->real filename in mysql. By this way I have direct access to the file and I can make a directory hierachy with the first characters of teh hash /c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files would be identically distributed, with around 200 files per dir (waht should not give any perfomance issues). But the requiremenst are to use the real file name for the directory tree, what gives the issue.
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
For instance, let's take "example.txt" as the file name.
Then let's hash it, say using MD5 (just for the sake of example, a simpler hash could give you good enough results and be quicker to calculate): $ echo -n example.txt | md5sum e76faa0543e007be095bb52982802abe -
Then say you take the first 4 digits of it to build the hash: e/7/6/f
Then you store file example.txt at: e/7/6/f/example.txt
The file still has its original name (example.txt), and if you want to find it, you can just calculate the hash for the name again, in which case you will find the e/7/6/f, and prepend that to the original name.
I would also suggest that you keep less directories levels with more branches on them, the optimal performance will be achieved by getting a balance of them. For example, in this case (4 hex digits) you would have 4 levels with 16 entries each. If you group the hex digits two by two, you would have (up to) 256 entries on each level, but only two levels of subdirectories. For instance: example.txt -> e7/6f/example.txt. That might (or might not) give you a better performance. A benchmark should tell you which one is better, but in any case, both of these setups will be many times faster than the one where you have 400,000 files in a single directory.
Would that help solve your issue?
HTH, Filipe
On Wed, 08 Jul 2009 18:09:28 -0400 Filipe Brandenburger wrote:
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
Now that is slick as all get-out. I'm really impressed your scheme, though I don't actually have any use for it right at this moment.
It's really clever.
On Wed, 2009-07-08 at 16:14 -0600, Frank Cox wrote:
On Wed, 08 Jul 2009 18:09:28 -0400 Filipe Brandenburger wrote:
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
Now that is slick as all get-out. I'm really impressed your scheme, though I don't actually have any use for it right at this moment.
It's really clever.
--- Yes it is but think about a SAN server with terabytes of data directories disparsed over multiple controllers. I'm am kinda curious how that would scale. That's my problem.
John
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
There are an issue I forgot to mention: the original file name can be up top 1023 characters long. As linux only allows 256 characters in the file path, I could have a (very small) number of collisions, that's why my original idea was using a hash->filename table. So I'm not sure if I could implement that idea in my scenario.
For instance: example.txt -> e7/6f/example.txt. That might (or might not) give you a better performance.
After a quick calculation, that could put around 3200 files per directory (I have around 15 million of files), I think that above 1000 files the performance will start to degrade significantly, anyway it would be a mater of doing some benchmarks.
Thanks for the advice.
_________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now! http://www.live.com/getstarted.aspx
oooooooooooo ooooooooooooo wrote:
You can hash it and still keep the original filename, and you don't even need a MySQL database to do lookups.
There are an issue I forgot to mention: the original file name can be up top 1023 characters long. As linux only allows 256 characters in the file path, I could have a (very small) number of collisions, that's why my original idea was using a hash->filename table. So I'm not sure if I could implement that idea in my scenario.
For instance: example.txt -> e7/6f/example.txt. That might (or might not) give you a better performance.
After a quick calculation, that could put around 3200 files per directory (I have around 15 million of files), I think that above 1000 files the performance will start to degrade significantly, anyway it would be a mater of doing some benchmarks.
There's C code to do this in squid, and backuppc does it in perl (for a pool directory where all identical files are hardlinked). Source for both is available and might be worth a look at their choices for the depth of the trees and collision handling (backuppc actually hashes the file content, not the name, though).
2009/7/9, oooooooooooo ooooooooooooo hhh735@hotmail.com:
After a quick calculation, that could put around 3200 files per directory (I have around 15 million of files), I think that above 1000 files the performance will start to degrade significantly, anyway it would be a mater of doing some benchmarks.
depending on the total size of this cache files, as it was suggested by nate - throw some hardware at it.
perhaps a hardware ram device will provide adequate performance :
http://www.tomshardware.com/reviews/hyperos-dram-hard-drive-block,1186.html
There's C code to do this in squid, and backuppc does it in perl (for a
pool directory where all identical files are hardlinked).
Unfortunately I have to write the file with some predefined format, so these would not provide the flexibility I need.
Rethink how you're writing files or you'll be in a world of hurt.
It's possible that I will be able to name the directory tree based in the hash of te file, so I would get the structure described in one of my previous post (4 directory levels, each directory name would be a single character from 0-9 and A-F, and 65536 (16^4) leaves, each leave containing 200 files). Do you think that this would really improve performance? Could this structure be improved?
BTW, you can pretty much say goodbye to any backup solution for this type
of project as well. They'll all die dealing with a file system structure like this.
We don't plan to use backups (if the data gets corrupted, we can retrieve it again), but thanks for teh advice.
I think entry level list pricing starts at about $80-100k for
1 NAS gateway (no disks).
That's far above the budget...
depending on the total size of this cache files, as it was suggested
by nate - throw some hardware at it.
Same that above, seems they don't want to spend more in HW (so I have to deal with all performance issues...). Anyway if I can get all the directories to have around 200 files, I think I will be able to make this with the current hardware.
Thanks for the advice.
_________________________________________________________________ Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.a...
On Thu, 9 Jul 2009, oooooooooooo ooooooooooooo wrote:
It's possible that I will be able to name the directory tree based in the hash of te file, so I would get the structure described in one of my previous post (4 directory levels, each directory name would be a single character from 0-9 and A-F, and 65536 (16^4) leaves, each leave containing 200 files). Do you think that this would really improve performance? Could this structure be improved?
If you don't plan on modifying the file after creation I could see it working. You could consider the use of a Berkley DB style database for quick and easy lookups on large amounts of data, but depending on your exact needs maintenance might be a chore and not really feasable.
It's an interesting suggestion but I don't know if it would actually work like you describe based on having to always compute the hash first.
On Thu, 2009-07-09 at 10:09 -0700, James A. Peltier wrote:
On Thu, 9 Jul 2009, oooooooooooo ooooooooooooo wrote:
It's possible that I will be able to name the directory tree based in the hash of te file, so I would get the structure described in one of my previous post (4 directory levels, each directory name would be a single character from 0-9 and A-F, and 65536 (16^4) leaves, each leave containing 200 files). Do you think that this would really improve performance? Could this structure be improved?
If you don't plan on modifying the file after creation I could see it working. You could consider the use of a Berkley DB style database for quick and easy lookups on large amounts of data, but depending on your exact needs maintenance might be a chore and not really feasable.
MUMPS DB will go at it even faster.
It's an interesting suggestion but I don't know if it would actually work like you describe based on having to always compute the hash first.
Indeed interesting. Actually it would be the same as taking the file to base 64 on final storage. My thoughts are it would would. Even faster would be to implement this with the table in RAM.
john
On a side note, perhaps this is something that Hadoop would be good with.
Hi, After talking with te customer, I finnaly managed to convince him for using the first characters of the hash as directory names.
Now I'm in doubt about the following options:
a) Using directory 4 levels /c/2/a/4/ (200 files per directory) and mysql with a hash->filename table, so I can get teh file name from the hash and then I can directly access it (I first query mysql for the hash of the file, and the I read the file).
b) Using 5 levels without mysql, and making a dir listing (due to technical issues, I can't only know an approximate file name, so I can't make a direct access here), match the file name and then read it. The issue here is that I would have 16^5 leave directories (more than a million).
I could also make more combinations of mysql/not mysql and number of levels.
What do you think it would give the best performance in ext3?
Thanks.
_________________________________________________________________ Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.a...
oooooooooooo ooooooooooooo wrote:
Hi, After talking with te customer, I finnaly managed to convince him for using the first characters of the hash as directory names.
Now I'm in doubt about the following options:
a) Using directory 4 levels /c/2/a/4/ (200 files per directory) and mysql with a hash->filename table, so I can get teh file name from the hash and then I can directly access it (I first query mysql for the hash of the file, and the I read the file).
b) Using 5 levels without mysql, and making a dir listing (due to technical issues, I can't only know an approximate file name, so I can't make a direct access here), match the file name and then read it. The issue here is that I would have 16^5 leave directories (more than a million).
I could also make more combinations of mysql/not mysql and number of levels.
What do you think it would give the best performance in ext3?
I don't think you've explained the constraint that would make you use mysql or not. I'd avoid it if everything involved can compute the hash or is passed the whole path since is bound to be slower than doing the math, and just on general principles I'd use a tree like 00/AA/FF/filename (three levels of 2 hex characters) as the first cut, although squid uses just two levels with a default of 16 first level and 256 2nd level directories and probably has some good reason for it.
I don't think you've explained the constraint that would make you use mysql or not.
My original idea was using the just the hash as filename, by this way I could have a direct access. But the customer rejected this and requested to have part of the long file name (from 11 to 1023 characters). As linux only allows 256 characters in the path and I could get duplicates with the 256 first chars, I trim teh real filename to around 200 characters and I add the hash at the end (plus a couple metadata small fields).
Yes, there requirements does not makes too much sense, but I've tried to convince the customer to use just the hash with no luck (seems he does not understand well what is a hash although I've tried to explain it several times).
That's why I need or a) use mysql or b) do a directory lising.
00/AA/FF/filename
That would make up to 256^3 directory leaves, what is more than 16 Million ones, due I have around 15M files, I think that this is an excessive number of directories.
_________________________________________________________________ Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour...
My original idea was using the just the hash as filename, by this way I could have a direct access. But the customer rejected this and requested to have part of the long file name (from 11 to 1023 characters). As linux only allows 256 characters in the path and I could get duplicates with the 256 first chars, I trim teh real filename to around 200 characters and I add the hash at the end (plus a couple metadata small fields).
Yes, there requirements does not makes too much sense, but I've tried to convince the customer to use just the hash with no luck (seems he does not understand well what is a hash although I've tried to explain it several times).
That's why I need or a) use mysql or b) do a directory lising.
I would use either only a database, or only the file system. To me - using them both is a violation of KISS.
If you were able to convince them to change the directory layout, and if you are more confortable with a database - try to convince them to use a database.
Ok, I coudl use mysql, but think we have around 15M entries and I would have to add to each a file from 1KB to 150KB, in total the files size can be around 200GB. How will be the performance of this in mysql?
_________________________________________________________________ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QB...
2009/7/10, oooooooooooo ooooooooooooo hhh735@hotmail.com:
Ok, I coudl use mysql, but think we have around 15M entries and I would have to add to each a file from 1KB to 150KB, in total the files size can be around 200GB. How will be the performance of this in mysql?
in the worst case - 150kb for a 15000000 of files I get:
15000000 * 150 / (1024 * 1024) 2145.76721191406250000000
or 2TB
According to my tests the average size per file is around 15KB (although there are files from 1Kb to 150KB).
_________________________________________________________________ Explore the seven wonders of the world http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=...
On Fri, Jul 10, 2009 at 16:21, Alexander Georgievalexander.georgiev@gmail.com wrote:
I would use either only a database, or only the file system. To me - using them both is a violation of KISS.
I disagree with your general statement.
Storing content that is appropriate for files (e.g., pictures) as BLOBs in an SQL database only makes it more complex.
Creating "clever" file formats to store relationships between objects in a filesystem instead of using a SQL database only makes it more complex (and harder to extend!).
Think a website that stores user's pictures and has social networking features (maybe like Flickr?). The natural place to store the JPEG images is the filesystem. The natural place to store user info, favorites, relations between users, etc. is the SQL database. If you try to do it different, it starts looking like you are trying to fit a square piece in a round hole. It may be possible to do it, but it is certainly not elegant.
Just because you are using less technologies doesn't necessarily make it simpler.
Filipe
2009/7/10, Filipe Brandenburger filbranden@gmail.com:
On Fri, Jul 10, 2009 at 16:21, Alexander Georgievalexander.georgiev@gmail.com wrote:
I would use either only a database, or only the file system. To me - using them both is a violation of KISS.
I disagree with your general statement.
Storing content that is appropriate for files (e.g., pictures) as BLOBs in an SQL database only makes it more complex.
Please, explain why. I was under the impression that storing large binary streams is BLOB's reason to exist.
Creating "clever" file formats to store relationships between objects in a filesystem instead of using a SQL database only makes it more complex (and harder to extend!).
Indeed.
Just because you are using less technologies doesn't necessarily make it simpler.
Of course, but if one of those technologies can provide both functionalities without hacks, twists and abuse, I would stay with that single technology.
oooooooooooo ooooooooooooo wrote:
I don't think you've explained the constraint that would make you use mysql or not.
My original idea was using the just the hash as filename, by this way I could have a direct access. But the customer rejected this and requested to have part of the long file name (from 11 to 1023 characters). As linux only allows 256 characters in the path and I could get duplicates with the 256 first chars, I trim teh real filename to around 200 characters and I add the hash at the end (plus a couple metadata small fields).
Yes, there requirements does not makes too much sense, but I've tried to convince the customer to use just the hash with no luck (seems he does not understand well what is a hash although I've tried to explain it several times).
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key? Do you have to track this relationship anyway - or age/expire content? I'd try to arrange things so the most likely scenario would take the fewest operations. Perhaps a mix of hash+filename would give direct access 99+% of the time and you could move all copies of collisions to a different area. Then you could keep the database mapping the full name to the hashed path but you'd only have to consult it when the open() attempt fails.
That's why I need or a) use mysql or b) do a directory lising.
00/AA/FF/filename
That would make up to 256^3 directory leaves, what is more than 16 Million ones, due I have around 15M files, I think that this is an excessive number of directories.
I guess that's why squid only uses 16 x 256...
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key?
The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.
Do you have to track this relationship anyway - or age/expire content?
I have to track the long filename -> short file name realation ship. Age is not relevant here.
I'd try to arrange things
so the most likely scenario would take the fewest operations. Perhaps a mix of hash+filename would give direct access 99+% of the time and you could move all copies of collisions to a different area.
yes its a good idea, but at this point I don't want to add more complexity tomy app, and having a separate area for collisions would make it more complex.
Then you could keep the database mapping the full name to the hashed path but you'd only have to consult it when the open() attempt fails.
As the long filename is up to 1023 chars long i can't index it with mysql (it has a lower max limit). That's why I use the hash which is indexed). What I do is keeping a list of just the md5 of teh cached files in memory in my app, before going to mysql, I frist check if it's in the list (realy a RB-Tree).
_________________________________________________________________ Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.a...
2009/7/11 oooooooooooo ooooooooooooo hhh735@hotmail.com:
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key?
The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.
The fact that this 1023 file name is unique is very nice. And no trimming is needed! I think you have 2 issues to deal with:
1) you have files with unique file names unfortunatelly with lenth <= 1023 characters. Regarding filenames and paths in linux and ext3 you have:
file name length limit = 254 bytes path length limit = 4096
If you try to store such a file directly, you will break the file name limit. But if you decompose the name into N chunks each of 250 characters, you will be able to preserve the file as a sequence of
N - 1 nested folders plus a file with a name equal to the Nth chunk residing into the N-1th folder.
Via this decomposition you will translate the unique 1023 character 'file name' into a unique 1023 character 'file path' with length lower than the path length limit
2) You suffer performance degradation when number of files in a folder goes beyond 1000.
Filipe Brandenburger has suggested a slick scheme to overcome this problem, that will work perfectly without a database:
============quote start $ echo -n example.txt | md5sum e76faa0543e007be095bb52982802abe -
Then say you take the first 4 digits of it to build the hash: e/7/6/f
Then you store file example.txt at: e/7/6/f/example.txt ============quote end
of course, "example.txt" might be a long filename: "exaaaaa ..... 1000 chars here .....txt" so after the "hash tree" e/7/6/f you will store the file path structure described in 1).
As was suggested by Les Mikesell, squid and other products have already implemented similar strategies, and you might be able to use either the algorithm or directly the code that implements it. I would spend some time investigating squid's code. I think squid has to deal with exactly same problem - cache the contents of resources whose urls might be > 254 characters.
If you use this approach - no need for a database to store hashes!
I did some tests on a Centos 3 system with the following script:
=====================script start #! /bin/bash
for a in a b c d e f g j; do f="" for i in `seq 1 250`; do f=$a$f done mkdir $f cd $f done pwd > some_file.txt =====================script end
which creates a nested directory structure with and a file in it. Total file path length is > 8 * 250. I had no problems accessing this file by its full path:
$ find ./ -name some* -exec cat {} ; | wc -c 2026
Thanks, using directories as file names is a great idea, anyway I'm not sure if that would solve my performance issue, as the bottleneck is the disk and not mysql. I just implemented the directories names based on the hash of the file and the performance is a bit slower than before. This is the output of atop (15 secs. avg.):
PRC | sys 0.53s | user 5.43s | #proc 112 | #zombie 0 | #exit 0 | CPU | sys 4% | user 54% | irq 2% | idle 208% | wait 131% | cpu | sys 1% | user 24% | irq 1% | idle 54% | cpu001 w 20% | cpu | sys 2% | user 15% | irq 1% | idle 31% | cpu002 w 52% | cpu | sys 1% | user 8% | irq 0% | idle 52% | cpu003 w 38% | cpu | sys 1% | user 7% | irq 0% | idle 71% | cpu000 w 21% | CPL | avg1 10.58 | avg5 6.92 | avg15 4.66 | csw 19112 | intr 19135 | MEM | tot 2.0G | free 49.8M | cache 157.4M | buff 116.8M | slab 122.7M | SWP | tot 1.9G | free 1.2G | | vmcom 2.2G | vmlim 2.9G | PAG | scan 1536 | stall 0 | | swin 9 | swout 0 | DSK | sdb | busy 91% | read 884 | write 524 | avio 6 ms | DSK | sda | busy 12% | read 201 | write 340 | avio 2 ms | NET | transport | tcpi 8551 | tcpo 8204 | udpi 702 | udpo 718 | NET | network | ipi 9264 | ipo 8946 | ipfrw 0 | deliv 9264 | NET | eth0 5% | pcki 6859 | pcko 6541 | si 5526 Kbps | so 466 Kbps | NET | lo ---- | pcki 2405 | pcko 2405 | si 397 Kbps | so 397 Kbps |
in sdb is the cache and in sda is all other stuff, including the mysql db files. Check that I have a lot of disk reads in sdb, but I'm really getting one file from disk for each 10 written, so my guess is that all other reads are directory listings. As I'm using the hash as directory names, (I think) this makes the linux cache slower, as the files are distributed in a more homogeneous and randomly way among the directories.
The app is running a bit slower than using the file name for directory name, although I expect (not really sure) that it will be better as the number of files on disk grows (currently there are only 600k files from 15M). My current performance is around 50 file i/o per second.
_________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now! http://www.live.com/getstarted.aspx
Thanks, using directories as file names is a great idea, anyway I'm not sure if that would solve my performance issue, as the bottleneck is the disk and not mysql.
The situation you described initally, suffers from only one issue - too many files in one single directory. You are not the fists fighting this - see qmail maildir, see squid etc. The remedy is always one and the same - split the files into a tree folder structure. For a sample implementaition - check out squid, backup pc etc ...
I just implemented the directories names based on the hash of the file and the performance is a bit slower than before. This is the output of atop (15 secs. avg.):
PRC | sys 0.53s | user 5.43s | #proc 112 | #zombie 0 | #exit 0 | CPU | sys 4% | user 54% | irq 2% | idle 208% | wait 131% | cpu | sys 1% | user 24% | irq 1% | idle 54% | cpu001 w 20% | cpu | sys 2% | user 15% | irq 1% | idle 31% | cpu002 w 52% | cpu | sys 1% | user 8% | irq 0% | idle 52% | cpu003 w 38% | cpu | sys 1% | user 7% | irq 0% | idle 71% | cpu000 w 21% | CPL | avg1 10.58 | avg5 6.92 | avg15 4.66 | csw 19112 | intr 19135 | MEM | tot 2.0G | free 49.8M | cache 157.4M | buff 116.8M | slab 122.7M | SWP | tot 1.9G | free 1.2G | | vmcom 2.2G | vmlim 2.9G |
I am under the impression that you are swapping. Out of 2GB of cache, you have just 157MB cache and 116MB buffers. What is eating the RAM? Why do you have 0.8GB swap used? You need more memory for file system cache.
PAG | scan 1536 | stall 0 | | swin 9 | swout 0 | DSK | sdb | busy 91% | read 884 | write 524 | avio 6 ms | DSK | sda | busy 12% | read 201 | write 340 | avio 2 ms | NET | transport | tcpi 8551 | tcpo 8204 | udpi 702 | udpo 718 | NET | network | ipi 9264 | ipo 8946 | ipfrw 0 | deliv 9264 | NET | eth0 5% | pcki 6859 | pcko 6541 | si 5526 Kbps | so 466 Kbps | NET | lo ---- | pcki 2405 | pcko 2405 | si 397 Kbps | so 397 Kbps |
in sdb is the cache and in sda is all other stuff, including the mysql db files. Check that I have a lot of disk reads in sdb, but I'm really getting one file from disk for each 10 written, so my guess is that all other reads are directory listings. As I'm using the hash as directory names, (I think) this makes the linux cache slower, as the files are distributed in a more homogeneous and randomly way among the directories.
I think that linux file system cache is smart enough for this type of load. How many files per directory do you have?
The app is running a bit slower than using the file name for directory name, although I expect (not really sure) that it will be better as the number of files on disk grows (currently there are only 600k files from 15M). My current performance is around 50 file i/o per second.
Something is wrong. Got to figure this out. Where did this RAM go?
On Sat, 2009-07-11 at 00:01 +0000, oooooooooooo ooooooooooooo wrote:
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key?
The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.
Do you have to track this relationship anyway - or age/expire content?
I have to track the long filename -> short file name realation ship. Age is not relevant here.
I'd try to arrange things
so the most likely scenario would take the fewest operations. Perhaps a mix of hash+filename would give direct access 99+% of the time and you could move all copies of collisions to a different area.
yes its a good idea, but at this point I don't want to add more complexity tomy app, and having a separate area for collisions would make it more complex.
Then you could keep the database mapping the full name to the hashed path but you'd only have to consult it when the open() attempt fails.
As the long filename is up to 1023 chars long i can't index it with mysql (it has a lower max limit). That's why I use the hash which is indexed). What I do is keeping a list of just the md5 of teh cached files in memory in my app, before going to mysql, I frist check if it's in the list (realy a RB-Tree).
--- It is 1024 chars long. Witch want still help. MSSQL 2005 and up is longer, if your interested: http://msdn.microsoft.com/en-us/library/ms143432.aspx But that greatly depends on your data size 900 bytes is the limit but can be exceeded.
You can use either one if you do a unique key id name for the index. File name to Unique short name. I would not store images in either one as your SELECT LIKE and Random will kill it. As much as I like DBs I have to say the flat file system is for those.
John
On Sat, 2009-07-11 at 11:48 -0400, JohnS wrote:
On Sat, 2009-07-11 at 00:01 +0000, oooooooooooo ooooooooooooo wrote:
You mentioned that the data can be retrieved from somewhere else. Is some part of this filename a unique key?
The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.
Do you have to track this relationship anyway - or age/expire content?
I have to track the long filename -> short file name realation ship. Age is not relevant here.
I'd try to arrange things
so the most likely scenario would take the fewest operations. Perhaps a mix of hash+filename would give direct access 99+% of the time and you could move all copies of collisions to a different area.
yes its a good idea, but at this point I don't want to add more complexity tomy app, and having a separate area for collisions would make it more complex.
Then you could keep the database mapping the full name to the hashed path but you'd only have to consult it when the open() attempt fails.
As the long filename is up to 1023 chars long i can't index it with mysql (it has a lower max limit). That's why I use the hash which is indexed). What I do is keeping a list of just the md5 of teh cached files in memory in my app, before going to mysql, I frist check if it's in the list (realy a RB-Tree).
It is 1024 chars long. Witch want still help. MSSQL 2005 and up is longer, if your interested: http://msdn.microsoft.com/en-us/library/ms143432.aspx But that greatly depends on your data size 900 bytes is the limit but can be exceeded.
You can use either one if you do a unique key id name for the index. File name to Unique short name. I would not store images in either one as your SELECT LIKE and Random will kill it. As much as I like DBs I have to say the flat file system is for those.
John
--- Just a random thought on Hashes VIA DB that none hardly give any thought about.
Using Extended Stored Procedures like:MSSQL. You can make your on hashes on the file insert.
USE master; EXEC sp_extendedproc 'your_md5', 'your_md5.dll'
Of course you will have to create your own .DLL to to do the Hashing.
Then create your on functions: SELECT dbo.your_md5('YourHash');
Direct: EXEC master.dbo.your_md5 'YourHash'
However I have not a clue that this is even doable in MySQL.
John
(i resent thsi message as previous one seems bad formatted, sorry for the mess).
Perhaps think about running tune2fs maybe also consider adding noatime
Yes, I added it and I got a perfomance increase, anyway as the number of fields grows the speed keeps going below an acceptable level.
I saw this article some time back.
http://www.linux.com/archive/feature/127055
Good idea, I already use mysql for indexing the files, so everytime I need to make a lookup I don't need the entire dir and then get the file, anyway my requirements are keeping the files on disk.
The only way to deal with it (especially if the
application adds and removes these files regularly) is to every once in a while copy the files to another directory, nuke the directory and restore from the copy.
Thanks, but there will not be too many file updates once the cache is done, so recreating directories can not be very helpful here. The issue is that as the number of files grows, bot reads from existing files and new insertion gets slower and slower.
I haven't done, or even seen, any recent benchmarks but I'd expect
reiserfs to still be the best at that sort of thing. I've looking at some benchmarks and reiser seems a bit faster in my scenario, however my problem happens when I have a arge number of files, for what I have seen, I'm not sure if reiser would be a fix....
However even if
you can improve things slightly, do not let whoever is responsible for that application ignore the fact that it is a horrible design that ignores a very well known problem that has easy solutions.
My original idea was storing the file with a hash of it name, and then store a hash->real filename in mysql. By this way I have direct access to the file and I can make a directory hierachy with the first characters of teh hash /c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files would be identically distributed, with around 200 files per dir (waht should not give any perfomance issues). But the requiremenst are to use the real file name for the directory tree, what gives the issue.
Did that program also write your address header ?
:)
Thanks for the help.
----------------------------------------
From: hhh735@hotmail.com To: centos@centos.org Date: Wed, 8 Jul 2009 06:27:40 +0000 Subject: [CentOS] Question about optimal filesystem with many small files.
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour... _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
_________________________________________________________________ News, entertainment and everything you care about at Live.com. Get it now! http://www.live.com/getstarted.aspx _________________________________________________________________ Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour...
On Wed, 8 Jul 2009, oooooooooooo ooooooooooooo wrote:
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
There isn't a good file system for this type of thing. filesystems with many very small files are always slow. Ext3, XFS, JFS are all terrible for this type of thing.
Rethink how you're writing files or you'll be in a world of hurt.
James A. Peltier wrote:
There isn't a good file system for this type of thing. filesystems with many very small files are always slow. Ext3, XFS, JFS are all terrible for this type of thing.
I can think of one...though you'll pay out the ass for it, the Silicon file system from BlueArc (NFS), file system runs on FPGAs. Our BlueArc's never had more than 50-100,000 files in any particular directory(millions in any particular tree), though they are supposed to be able to handle this sort of thing quite well.
I think entry level list pricing starts at about $80-100k for 1 NAS gateway (no disks).
Our BlueArc's went end of life earlier this year and we migrated to an Exanet cluster(runs on top of CentOS 4.4 though uses it's own file system, clustering and NFS services) which is still very fast though not as fast as BlueArc.
And with block based replication it doesn't matter how many files there are, performance is excellent for backup, send data to another rack in your data center or to another continent over the WAN. In BlueArc's case transparently send data to a dedupe device or tape drive based on dynamic access patterns(and move it back automatically when needed).
http://www.bluearc.com/html/products/file_system.shtml http://www.exanet.com/default.asp?contentID=231
Both systems scale to gigabytes/second of throughput linearly, and petabytes of storage without downtime. The only downside to BlueArc is their back end storage, they only offer tier 2 storage and only have HDS for tier 1. You can make an HDS perform but it'll cost you even more..The tier 2 stuff is too unreliable(LSI logic). Exanet at least supports almost any storage out there(we went with 3PAR).
Don't even try to get a netapp to do such a thing.
nate
On Wed, 8 Jul 2009, oooooooooooo ooooooooooooo wrote:
Hi,
I have a program that writes lots of files to a directory tree (around 15 Million fo files), and a node can have up to 400000 files (and I don't have any way to split this ammount in smaller ones). As the number of files grows, my application gets slower and slower (the app is works something like a cache for another app and I can't redesign the way it distributes files into disk due to the other app requirements).
The filesystem I use is ext3 with teh following options enabled:
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
Is there any way to improve performance in ext3? Would you suggest another FS for this situation (this is a prodution server, so I need a stable one) ?
Thanks in advance (and please excuse my bad english).
BTW, you can pretty much say goodbye to any backup solution for this type of project as well. They'll all die dealing with a file system structure like this
-- James A. Peltier Systems Analyst (FASNet), VIVARIUM Technical Director HPC Coordinator Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca http://blogs.sfu.ca/people/jpeltier MSN : subatomic_spam@hotmail.com
The point of the HPC scheduler is to keep everyone equally unhappy.
How many files per directory do you have?
I have 4 directory levels, 65536 leaves directories and around 200 files per dir (15M in total)-
Something is wrong. Got to figure this out. Where did this RAM go?
Thanks I reduced the memory usage of mysql and my app it and I got around a 15% performance increase. Now my atop looks like this (currently reading only cached files from disk).
PRC | sys 0.51s | user 9.29s | #proc 114 | #zombie 0 | #exit 0 | CPU | sys 4% | user 93% | irq 1% | idle 208% | wait 94% | cpu | sys 2% | user 48% | irq 1% | idle 21% | cpu001 w 28% | cpu | sys 1% | user 17% | irq 0% | idle 41% | cpu000 w 40% | cpu | sys 1% | user 14% | irq 0% | idle 74% | cpu003 w 12% | cpu | sys 1% | user 13% | irq 0% | idle 72% | cpu002 w 14% | CPL | avg1 3.45 | avg5 7.42 | avg15 10.76 | csw 15891 | intr 11695 | MEM | tot 2.0G | free 51.2M | cache 587.8M | buff 1.0M | slab 281.2M | SWP | tot 1.9G | free 1.9G | | vmcom 1.6G | vmlim 2.9G | PAG | scan 3072 | stall 0 | | swin 0 | swout 0 | DSK | sdb | busy 89% | read 1451 | write 0 | avio 6 ms | DSK | sda | busy 6% | read 178 | write 54 | avio 2 ms | NET | transport | tcpi 3631 | tcpo 3629 | udpi 0 | udpo 0 | NET | network | ipi 3632 | ipo 3630 | ipfrw 0 | deliv 3632 | NET | eth0 0% | pcki 5 | pcko 3 | si 0 Kbps | so 1 Kbps | NET | lo ---- | pcki 3627 | pcko 3627 | si 775 Kbps | so 775 Kbps |
It is 1024 chars long. Witch want still help.
I'm usng mysam and according to: http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html "The maximum key length is 1000 bytes. This can also be changed by changing the source and recompiling. For the case of a key longer than 250 bytes, a larger key block size than the default of 1024 bytes is used. "
I would not store images in either one
as your SELECT LIKE and Random will kill it.
Well, I think that this can be avoided, using just searches in teh key fields should not give these issues. Does somebody have experience storing a large amount of medium (1KB-150KB) blob objects in mysql?
However I have not a clue that this is even doable in MySQL.
In mysql there is already a MD5 funtion: http://dev.mysql.com/doc/refman/5.1/en/encryption-functions.html#function_md...
Thanks for the help.
_________________________________________________________________ Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&sour...
On Mon, 2009-07-13 at 05:49 +0000, oooooooooooo ooooooooooooo wrote:
It is 1024 chars long. Witch want still help.
I'm usng mysam and according to: http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html "The maximum key length is 1000 bytes. This can also be changed by changing the source and recompiling. For the case of a key longer than 250 bytes, a larger key block size than the default of 1024 bytes is used. "
I would not store images in either one
as your SELECT LIKE and Random will kill it.
Well, I think that this can be avoided, using just searches in teh key fields should not give these issues. Does somebody have experience storing a large amount of medium (1KB-150KB) blob objects in mysql?
True
An option would be to encode them to Base64 on INSERT but if you Index all of you BLOBS on INSERT really there should be no problem. Besides 150Kb is not a big for a BLOB. Consider 20MB to 100MB with multiple joins on MSSQL, 64Bit although. Apparently size is based on the maximum amount of memory the client has. VARBLOB apparently has no limit per docs. As doing this on MySQL I can not relate to. I can on DB2 and MSSQL. I can say you can rival the 32Bit MSSQL performance by at least 15 percent. I can only say that I have experiance with raw DB predictions in Graphing. Edge and Adjacency Modeling on MySQL.
What I see slowing you down is the TQSL and SPROCS. The dll for the md5 I posted earlier will scale to 1000s of inserts at the time. If speed is really your essence then use RAW Partitions for the DB and RAM. Use the MySQL Connector or the ODBC or you will hit size limits on INSERT and SELECT.
However I have not a clue that this is even doable in MySQL.
In mysql there is already a MD5 funtion: http://dev.mysql.com/doc/refman/5.1/en/encryption-functions.html#function_md...
Yes, I was informed that a call from a SPROC to "md5()" would do the trick and take the load of the client. At least that was my intent of the idea to balance the load. That is if this is client/server.
I do wonder about your memory allocation and disk. It is all about the DB design. Think about a Genealogy DB. Where do you end design? You don't. Where does predictions end? They don't.
John
JohnS wrote:
On Mon, 2009-07-13 at 05:49 +0000, oooooooooooo ooooooooooooo wrote:
It is 1024 chars long. Witch want still help.
I'm usng mysam and according to: http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html "The maximum key length is 1000 bytes. This can also be changed by changing the source and recompiling. For the case of a key longer than 250 bytes, a larger key block size than the default of 1024 bytes is used. "
I would not store images in either one
as your SELECT LIKE and Random will kill it.
Well, I think that this can be avoided, using just searches in teh key fields should not give these issues. Does somebody have experience storing a large amount of medium (1KB-150KB) blob objects in mysql?
True
An option would be to encode them to Base64 on INSERT but if you Index all of you BLOBS on INSERT really there should be no problem. Besides 150Kb is not a big for a BLOB. Consider 20MB to 100MB with multiple joins on MSSQL, 64Bit although. Apparently size is based on the maximum amount of memory the client has. VARBLOB apparently has no limit per docs. As doing this on MySQL I can not relate to. I can on DB2 and MSSQL. I can say you can rival the 32Bit MSSQL performance by at least 15 percent. I can only say that I have experiance with raw DB predictions in Graphing. Edge and Adjacency Modeling on MySQL.
What I see slowing you down is the TQSL and SPROCS. The dll for the md5 I posted earlier will scale to 1000s of inserts at the time. If speed is really your essence then use RAW Partitions for the DB and RAM. Use the MySQL Connector or the ODBC or you will hit size limits on INSERT and SELECT.
However I have not a clue that this is even doable in MySQL.
In mysql there is already a MD5 funtion: http://dev.mysql.com/doc/refman/5.1/en/encryption-functions.html#function_md...
Yes, I was informed that a call from a SPROC to "md5()" would do the trick and take the load of the client. At least that was my intent of the idea to balance the load. That is if this is client/server.
I do wonder about your memory allocation and disk. It is all about the DB design. Think about a Genealogy DB. Where do you end design? You don't. Where does predictions end? They don't.
I think you are making this way too complicated. You are going to end up filling a large disk with small bits of data and your speed is going to be limited by how fast the disk head can get to the right place for anything that isn't already in a buffer. Other than the special case of too many entries in a single directory, the software overhead isn't going to make much difference unless you can effectively predict what you are likely to want next or keep the most popular things in your buffers. Hardware-wise, adding RAM is likely to help even if it is just for the filesystem inode/directory cache - and if you are lucky, the LRU data buffering. Also, spreading your data over several disks would help by reducing the head contention.