Hi All,
I have been trying out XFS given it is going to be the file system of choice from upstream in el7. Starting with an Adaptec ASR71605 populated with sixteen 4TB WD enterprise hard drives. The version of OS is 6.4 x86_64 and has 64G of RAM.
This next part was not well researched as I had a colleague bothering me late on Xmas Eve that he needed 14 TB immediately to move data to from an HPC cluster. I built an XFS file system straight onto the (raid 6) logical device made up of all sixteen drives with.
mkfs.xfs -d su=512k,sw=14 /dev/sda
where "512k" is the Stripe-unit size of the single logical device built on the raid controller. "14" is from the total number of drives minus two (raid 6 redundancy).
Any comments on the above from XFS users would be helpful!
I mounted the filesystem with the default options assuming they would be sensible but I now believe I should have specified the "inode64" mount option to avoid all the inodes will being stuck in the first TB.
The filesystem however is at 87% and does not seem to have had any issues/problems.
df -h | grep raid
/dev/sda 51T 45T 6.7T 87% /raidstor
Another question is could I now safely remount with the "inode64" option or will this cause problems in the future? I read this below in the XFS FAQ but wondered if it have been fixed (backported?) into el6.4?
""Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.""
I also noted that "xfs_check" ran out of memory and so after some reading noted that it is reccommended to use "xfs_repair -n -vv" instead as it uses far less memory. One remark is so why is "xfs_check" there at all?
I do have the option of moving the data elsewhere and rebuilding but this would cause some problems. Any advice much appreciated.
Steve
On 2014-01-21, Steve Brooks steveb@mcs.st-and.ac.uk wrote:
mkfs.xfs -d su=512k,sw=14 /dev/sda
where "512k" is the Stripe-unit size of the single logical device built on the raid controller. "14" is from the total number of drives minus two (raid 6 redundancy).
The usual advice on the XFS list is to use the defaults where possible. But you might want to ask there to see if they have any specific advice.
I mounted the filesystem with the default options assuming they would be sensible but I now believe I should have specified the "inode64" mount option to avoid all the inodes will being stuck in the first TB.
The filesystem however is at 87% and does not seem to have had any issues/problems.
df -h | grep raid
/dev/sda 51T 45T 6.7T 87% /raidstor
Wow, impressive! I know of a much smaller fs which got bit by this issue. What probably happened is, as a new fs, the entire first 1TB was able to be reserved for inodes.
Another question is could I now safely remount with the "inode64" option or will this cause problems in the future? I read this below in the XFS FAQ but wondered if it have been fixed (backported?) into el6.4?
I have mounted a large XFS fs that previously didn't use inode64 with it, and it went fine. (I did not attempt to roll back.) You *must* umount and remount for this option to take effect. I do not know when the inode64 option made it to CentOS, but it is there now.
I also noted that "xfs_check" ran out of memory and so after some reading noted that it is reccommended to use "xfs_repair -n -vv" instead as it uses far less memory. One remark is so why is "xfs_check" there at all?
The XFS team is working on deprecating it. But on a 51TB filesystem xfs_repair will still use a lot of memory. Using -P can help, but it'll still use quite a bit (depending on the extent of any damage and how many inodes, probably a bunch of other factors I don't know).
--keith
On Tue, 21 Jan 2014, Keith Keller wrote:
On 2014-01-21, Steve Brooks steveb@mcs.st-and.ac.uk wrote:
mkfs.xfs -d su=512k,sw=14 /dev/sda
where "512k" is the Stripe-unit size of the single logical device built on the raid controller. "14" is from the total number of drives minus two (raid 6 redundancy).
The usual advice on the XFS list is to use the defaults where possible. But you might want to ask there to see if they have any specific advice.
Thanks for the reply Keith. Yes I will ask on the list, I did read that when built on mdadm raid devices it is geared up to tune itself but with hardware raid it may take manual tuning.
I mounted the filesystem with the default options assuming they would be sensible but I now believe I should have specified the "inode64" mount option to avoid all the inodes will being stuck in the first TB.
The filesystem however is at 87% and does not seem to have had any issues/problems.
df -h | grep raid
/dev/sda 51T 45T 6.7T 87% /raidstor
Wow, impressive! I know of a much smaller fs which got bit by this issue. What probably happened is, as a new fs, the entire first 1TB was able to be reserved for inodes.
Yes and the output of "df -i" shows only
Filesystem Inodes IUsed IFree IUse% /dev/sda 2187329088 189621 2187139467 1%
So few inodes are used because the data is from "hpc" used to run MHD (magneto hydro-dynamics) simulations on the Sun many of the files are snapshots of the simulation at various instances "93G" in size etc.
Another question is could I now safely remount with the "inode64" option or will this cause problems in the future? I read this below in the XFS FAQ but wondered if it have been fixed (backported?) into el6.4?
I have mounted a large XFS fs that previously didn't use inode64 with it, and it went fine. (I did not attempt to roll back.) You *must* umount and remount for this option to take effect. I do not know when the inode64 option made it to CentOS, but it is there now.
Ok so I am sort of wondering for this filesystem if it is actually worth it given lack of inodes does not look like it will be an issue.
I also noted that "xfs_check" ran out of memory and so after some reading noted that it is reccommended to use "xfs_repair -n -vv" instead as it uses far less memory. One remark is so why is "xfs_check" there at all?
The XFS team is working on deprecating it. But on a 51TB filesystem xfs_repair will still use a lot of memory. Using -P can help, but it'll still use quite a bit (depending on the extent of any damage and how many inodes, probably a bunch of other factors I don't know).
Yes this bothers me a bit, I issued a " xfs_repair -n -vv" and that told me I only needed "6G" I guess with only a few inodes and a clean filesystem it makes sense. I did read a good solution on the XFS mailing list which seems really neat..
"Add an SSD of sufficient size/speed for swap duty to handle xfs_repair requirements for filesystems with arbitrarily high inode counts. Create a 100GB swap partition and leave the remainder unallocated. The unallocated space will automatically be used for GC and wear leveling, increasing the life of all cells in the drive."
Steve
Hi,
----- Original Message ----- | | Hi All, | | I have been trying out XFS given it is going to be the file system of | choice from upstream in el7. Starting with an Adaptec ASR71605 | populated | with sixteen 4TB WD enterprise hard drives. The version of OS is 6.4 | x86_64 and has 64G of RAM.
Good! You're going to need it with a volume that large!
| This next part was not well researched as I had a colleague bothering | me | late on Xmas Eve that he needed 14 TB immediately to move data to | from an | HPC cluster. I built an XFS file system straight onto the (raid 6) | logical | device made up of all sixteen drives with. | | | > mkfs.xfs -d su=512k,sw=14 /dev/sda | | | where "512k" is the Stripe-unit size of the single logical device | built on | the raid controller. "14" is from the total number of drives minus | two | (raid 6 redundancy).
Whoa! What kind of data are you writing to disk? I hope they're files that are typically large to account for such a large stripe unit or you're going to lose a lot of the performance benefits. It will write quite a bit of data to an individual drive in the RAID this way.
| Any comments on the above from XFS users would be helpful! | | I mounted the filesystem with the default options assuming they would | be | sensible but I now believe I should have specified the "inode64" | mount | option to avoid all the inodes will being stuck in the first TB. | | The filesystem however is at 87% and does not seem to have had any | issues/problems. | | > df -h | grep raid | /dev/sda 51T 45T 6.7T 87% /raidstor | | Another question is could I now safely remount with the "inode64" | option | or will this cause problems in the future? I read this below in the | XFS | FAQ but wondered if it have been fixed (backported?) into el6.4? | | ""Starting from kernel 2.6.35, you can try and then switch back. | Older | kernels have a bug leading to strange problems if you mount without | inode64 again. For example, you can't access files & dirs that have | been | created with an inode >32bit anymore.""
Changing to inode64 and back is no problem. Keep in mind that inode64 may not work with clients running older operating systems. This bit us when we had a mixture of Solaris 8/9 clients.
| I also noted that "xfs_check" ran out of memory and so after some | reading | noted that it is reccommended to use "xfs_repair -n -vv" instead as | it | uses far less memory. One remark is so why is "xfs_check" there at | all?
That's because it didn't do anything. Trust me, when you actually go and run xfs_{check,repair} without the -n flag, you're gonna need A LOT of memory. For example a 11TB file system use 24GB of memory for an xfs_repair for a filesystem that held medical imaging data. Good luck!
As for why xfs_check is there, there are various reasons for it. For example, it's your go-to program for fixing quota issues, which we've had a couple issues with quotas that xfs_check pointed out so that we could then run xfs_repair. Keep in mind that xfs_check's are not run on unclean shutdowns. The XFS log is merely replayed and you're advised to run xfs_check to validate the file system consistency.
| I do have the option of moving the data elsewhere and rebuilding but | this | would cause some problems. Any advice much appreciated.
Do you REALLY need it to be a single volume that is so large?
On 2014-01-21, James A. Peltier jpeltier@sfu.ca wrote:
Changing to inode64 and back is no problem. Keep in mind that inode64 may not work with clients running older operating systems. This bit us when we had a mixture of Solaris 8/9 clients.
I assume you are referring to NFS specifically; here's the relevant FAQ entry for those who haven't seen it:
http://www.xfs.org/index.php/XFS_FAQ#Q:_Why_doesn.27t_NFS-exporting_subdirec...
I decided to export my filesystem's root, but different setups may find it easier to use fsid=uuid instead.
--keith