[CentOS] inquiry about limitation of file system

Sun Nov 4 00:40:36 UTC 2018
Gordon Messmer <gordon.messmer at gmail.com>

On 11/3/18 12:44 AM, yf chu wrote:
> I wonder whether the performance will be affected if there are too many files and directories on the server.


With XFS on modern CentOS systems, you probably don't need to worry:
https://www.youtube.com/watch?v=FegjLbCnoBw

For older systems, as best I understand it: As the directory tree grows, 
the answer to your question depends on how many entries are in the 
directories, how deep the directory structure is, and how random the 
access pattern is.  Ultimately, you want to minimize the number of 
individual disk reads required.

Directories with lots of entries is one situation where you may see 
performance degrade.  Typically around the time the directory grows 
larger than the maximum size of the direct block list [1] (48k), reading 
the directory starts to take a little longer. After the maximum size of 
the single indirect block list (4MB), it will tend to get slower again.  
File names impact directory size, so average filename length factors in, 
as well as the number of files.

A given file lookup will need to reach each of the parent directories to 
locate the next item in the path.  If your path is very deep, then your 
directories are likely to be smaller on average, but you're increasing 
the number of lookups required for parent directories to reduce the 
length of the block list.  It might make your worst-case better, but 
your best-case is probably worse.

The system's cache means that accessing a few files in a large structure 
is not as expensive as random files in a large structure.  If you have a 
large structure, but users tend to access mostly the same files at any 
given time, then the system won't be reading the disk for every lookup.  
If accesses aren't random, then structure size becomes less important.

Hashed name directory structure has been mentioned, and those can be 
useful if you have a very large number of objects to store, and they all 
have the same permission set.  A hashed name structure typically 
requires that you  store in a database a map between the original names 
(that users see) and the names' hashes.  You could hash each name at 
lookup, but that doesn't give you a good mechanism for dealing with 
collisions.  Hashed name directory structures typically have a worse 
best-case performance due to the lookup, but they offer predictable and 
even growth for lookup times for each file.  Where a free-form directory 
structure might have a large difference between the best-case and 
worst-case lookup, a hashed name directory structure should be roughly 
the same access time for all files.


1: https://en.wikipedia.org/wiki/Inode_pointer_structure