On 11/3/18 12:44 AM, yf chu wrote:
I wonder whether the performance will be affected if there are too many files and directories on the server.
With XFS on modern CentOS systems, you probably don't need to worry: https://www.youtube.com/watch?v=FegjLbCnoBw
For older systems, as best I understand it: As the directory tree grows, the answer to your question depends on how many entries are in the directories, how deep the directory structure is, and how random the access pattern is. Ultimately, you want to minimize the number of individual disk reads required.
Directories with lots of entries is one situation where you may see performance degrade. Typically around the time the directory grows larger than the maximum size of the direct block list [1] (48k), reading the directory starts to take a little longer. After the maximum size of the single indirect block list (4MB), it will tend to get slower again. File names impact directory size, so average filename length factors in, as well as the number of files.
A given file lookup will need to reach each of the parent directories to locate the next item in the path. If your path is very deep, then your directories are likely to be smaller on average, but you're increasing the number of lookups required for parent directories to reduce the length of the block list. It might make your worst-case better, but your best-case is probably worse.
The system's cache means that accessing a few files in a large structure is not as expensive as random files in a large structure. If you have a large structure, but users tend to access mostly the same files at any given time, then the system won't be reading the disk for every lookup. If accesses aren't random, then structure size becomes less important.
Hashed name directory structure has been mentioned, and those can be useful if you have a very large number of objects to store, and they all have the same permission set. A hashed name structure typically requires that you store in a database a map between the original names (that users see) and the names' hashes. You could hash each name at lookup, but that doesn't give you a good mechanism for dealing with collisions. Hashed name directory structures typically have a worse best-case performance due to the lookup, but they offer predictable and even growth for lookup times for each file. Where a free-form directory structure might have a large difference between the best-case and worst-case lookup, a hashed name directory structure should be roughly the same access time for all files.