XFS on a 25 TB device

List overview All Threads
Download

newer

older

Acronis True Image or Clonezilla

qemu

Boris Epstein

29 Sep 2010 29 Sep '10

2 p.m.

Hello all,

I have just configured a 64-bit CentOS 5.5 machine to support an XFS filesystem as specified in the subject line. The filesystem will be used to store an extremely large number of files (in the tens of millions). Due to its extremely large size, would there be any non-standard XFS build/configuration options I should consider?

Thanks.

Boris.

Attachments:

attachment.html (text/html — 431 bytes)

Show replies by date

Karanbir Singh

29 Sep 29 Sep

4:11 p.m.

hi Boris,

On 09/29/2010 02:00 PM, Boris Epstein wrote:

...

I have just configured a 64-bit CentOS 5.5 machine to support an XFS

I dont have any specific hints for you - but when you are done, a page in the centos wiki would be nice to have, with challenges and options you had to work through along with any recommendations!

thanks in advance. :)

- KB

Peter Kjellstrom

4:27 p.m.

On Wednesday 29 September 2010, Boris Epstein wrote:

...

Hello all,

I have just configured a 64-bit CentOS 5.5 machine to support an XFS filesystem as specified in the subject line. The filesystem will be used to store an extremely large number of files (in the tens of millions). Due to its extremely large size, would there be any non-standard XFS build/configuration options I should consider?

I have created and tested filesystems larger than 25T using xfs on CentOS-5 (64-bit). I did not use any non-standard options. Do not attempt this on a 32-bit box.

However, given the size of the device I assume that this is a raid of some sort. You'll want to make sure to run mkfs.xfs with the proper stripe parameters to get the alignment right. Also, you may want to make sure your LVM or partition table is properly aligned.

Even with the above done right you may get worse performance than expected since "lots of small files" typically reads like "terrible performance".

Finally I'd suggest you fill the filesystem and read it back (verifying what you wrote). This is, imho, a reasonable level of paranoia.

/Peter

...

Thanks.

Boris.

James A. Peltier

4:53 p.m.

On my 30+TB file systems all I've done is mkfs.xfs with stripe and width parameters and they are very speedy. I've not done anything on the LVM side and see no performance issues, but perhaps I need to investigate that some more. :\

-- James A. Peltier Systems Analyst (FASNet), VIVARIUM Technical Director Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca MSN : subatomic_spam@hotmail.com

Does your OS has a man 8 lart? http://www.xinu.nl/unix/humour/asr-manpages/lart.html

Boris Epstein

5:02 p.m.

On Wed, Sep 29, 2010 at 11:53 AM, James A. Peltier jpeltier@sfu.ca wrote:

...

----- Original Message ----- | On Wednesday 29 September 2010, Boris Epstein wrote: | > Hello all, | > | > I have just configured a 64-bit CentOS 5.5 machine to support an XFS | > filesystem as specified in the subject line. The filesystem will be | > used to | > store an extremely large number of files (in the tens of millions). | > Due to | > its extremely large size, would there be any non-standard XFS | > build/configuration options I should consider? | | I have created and tested filesystems larger than 25T using xfs on | CentOS-5 | (64-bit). I did not use any non-standard options. Do not attempt this | on a | 32-bit box. | | However, given the size of the device I assume that this is a raid of | some | sort. You'll want to make sure to run mkfs.xfs with the proper stripe | parameters to get the alignment right. Also, you may want to make sure | your | LVM or partition table is properly aligned. | | Even with the above done right you may get worse performance than | expected | since "lots of small files" typically reads like "terrible | performance". | | Finally I'd suggest you fill the filesystem and read it back | (verifying what | you wrote). This is, imho, a reasonable level of paranoia. | | /Peter | | > Thanks. | > | > Boris. | | _______________________________________________ | CentOS mailing list | CentOS@centos.org | http://lists.centos.org/mailman/listinfo/centos

On my 30+TB file systems all I've done is mkfs.xfs with stripe and width parameters and they are very speedy. I've not done anything on the LVM side and see no performance issues, but perhaps I need to investigate that some more. :\

-- James A. Peltier Systems Analyst (FASNet), VIVARIUM Technical Director Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca MSN : subatomic_spam@hotmail.com

Does your OS has a man 8 lart? http://www.xinu.nl/unix/humour/asr-manpages/lart.html

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Thanks James!

I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

Boris.

Joseph L. Casale

5:12 p.m.

...

I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

25 TB on a single volume, not distributed? Huh, let me know how long that takes to check the first time something sh!ts the bed and instead of some amount of a distributed portion of the data drops out of users reach, it *all* does:)

Boris Epstein

5:52 p.m.

On Wed, Sep 29, 2010 at 12:12 PM, Joseph L. Casale jcasale@activenetwerx.com wrote:

...

...
I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

25 TB on a single volume, not distributed? Huh, let me know how long that takes to check the first time something sh!ts the bed and instead of some amount of a distributed portion of the data drops out of users reach, it *all* does:) _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

it is a single logical volume, spread over 16 physical disks controlled by a hrdware RAID controller. Dunno - we have a similar setup in two other machines (smaller disks, same idea), been going strong for over 3 years now.

Boris.

Peter Kjellstrom

5:58 p.m.

On Wednesday 29 September 2010, Boris Epstein wrote:

...

On Wed, Sep 29, 2010 at 11:53 AM, James A. Peltier jpeltier@sfu.ca wrote:

...
| However, given the size of the device I assume that this is a raid of | some | sort. You'll want to make sure to run mkfs.xfs with the proper stripe | parameters to get the alignment right. Also, you may want to make sure | your | LVM or partition table is properly aligned.

...

I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

That is why you need to consider it. If the device is aligned on stripe size (chunk size * (number of drives - 2 for raid6 parity)) and the filesystem is made aware it can put stuff (files, metadata, etc.) so that a minimum of stripes are touched (less I/O done).

/Peter

...

Boris.

Boris Epstein

6:03 p.m.

On Wed, Sep 29, 2010 at 12:58 PM, Peter Kjellstrom cap@nsc.liu.se wrote:

...

On Wednesday 29 September 2010, Boris Epstein wrote:

...
On Wed, Sep 29, 2010 at 11:53 AM, James A. Peltier jpeltier@sfu.ca wrote:

...
| However, given the size of the device I assume that this is a raid of | some | sort. You'll want to make sure to run mkfs.xfs with the proper stripe | parameters to get the alignment right. Also, you may want to make sure | your | LVM or partition table is properly aligned.

...

...
I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

That is why you need to consider it. If the device is aligned on stripe size (chunk size * (number of drives - 2 for raid6 parity)) and the filesystem is made aware it can put stuff (files, metadata, etc.) so that a minimum of stripes are touched (less I/O done).

/Peter

...
Boris.

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Well, you are interfering with the hardware RAID controller which copies around and stripes data as it sees fit. I am not sure with this many levels of abstraction I can gain any measurable performance improvement by adjusting the XFS to the controller's hypothetical behaviour.

Boris.

Peter Kjellstrom

6:25 p.m.

On Wednesday 29 September 2010, Boris Epstein wrote:

...

On Wed, Sep 29, 2010 at 12:58 PM, Peter Kjellstrom cap@nsc.liu.se wrote:

...
On Wednesday 29 September 2010, Boris Epstein wrote:

...

...
...
I am wondering if I need to worry about stripe and width though as mine resides on a logical volume residing on a hardware-controlled RAID 6 device (i.e., one slice as far as the OS is concerned).

That is why you need to consider it. If the device is aligned on stripe size (chunk size * (number of drives - 2 for raid6 parity)) and the filesystem is made aware it can put stuff (files, metadata, etc.) so that a minimum of stripes are touched (less I/O done).

...

Well, you are interfering with the hardware RAID controller which copies around and stripes data as it sees fit. I am not sure with this many levels of abstraction I can gain any measurable performance improvement by adjusting the XFS to the controller's hypothetical behaviour.

You are a bit mistaken. The raid controller does not "copy data around as it sees fit". It stores data on each disk in chunk-size'ed pieces. It then stripes this across all drives giving you a stripe-size'ed piece of chunk size times the number of data drives.

Typical chunck sizes are 16, 32, 64, 128 and 256 KiB. If you created your raid-set with, say, 128 KiB chunk size and 16 physical drives this will give you a stripe size of:

128 * (16 - 2) => 1792 KiB

Having the filesystem align its stuctures to this can (of course depending on work load) make a huge difference. But you won't be able to do this if your device isn't already aligned (unaligned use of partitions and/or LVM).

Then again, for other workloads the effect could be insignificant. YMMV.

/Peter

Lamar Owen

7:53 p.m.

On Wednesday, September 29, 2010 01:25:11 pm Peter Kjellstrom wrote:

...

You are a bit mistaken. The raid controller does not "copy data around as it sees fit". It stores data on each disk in chunk-size'ed pieces. It then stripes this across all drives giving you a stripe-size'ed piece of chunk size times the number of data drives.

[Snip math]

...

Then again, for other workloads the effect could be insignificant. YMMV.

For a simple RAID controller I can see some benefit.

However, in my case the 'RAID controller' is on SAN, consisting of three EMC Clariion arrays: a CX3-10c, a CX3-80, and a CX700. The EMC Navisphere/Unisphere tools allow LUN migration across RAID groups; I could very well take a LUN from a RAID1/0 with 16 drives to a RAID5 with 9 drives to a RAID6 with 10 drives to a RAID6 with 16 drives and have different stripe sizes. Further, since this is all being accessed through VMware ESX, I'm limited to 2TB LUNs anyway, even using raw device mappings, which I do, but for a different reason; LVM to the rescue to get this: [root@backup-rdc ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 37G 18G 18G 50% / /dev/sda1 99M 26M 69M 28% /boot /dev/mapper/dasch--backup-volume1 21T 19T 2.6T 88% /opt/backups tmpfs 1006M 0 1006M 0% /dev/shm /dev/mapper/dasch--rdc-cx3--80 23T 19T 4.2T 82% /opt/dasch-rdc [root@backup-rdc ~]#

Yeah, the output of pvscan is pretty long (it has been longer, and seeing things like /dev/sdak1 is strange....).

Using XFS at the moment. The two volume groups are on two different arrays; one is on the CX700 and the other on the CX3-80, and they're physically separated at two locations on-campus, with single-mode 4Gb/s FC ISL's between switches. They're soon to be connected to different VMware ESX hosts; the dual fibre-channel connect was so the initial sync time would be reasonable.

I looked through all the performance optimization howtos for XFS that I could find, but then realized how futile that would be with these 'RAID controllers' and their massive caches (our CX3-80 SP's have 8GB of RAM each; the shared write cache and the variable-sized read cache, which I have set up for a rather large size on our CX3-80: 3GB on each SP for read, and 2GB for write; the CX700 has 4GB (actually 3968MB) split 1GB read 2GB write); the benchmarks that I did (that I can't release due to both EMC and VMware's EULAs' prohibitions) showed that the performance differences with alignment versus without were insignificant with these 'RAID controllers'.

But for something inside the server, like a 3ware 9500 or similar, it might be worthwhile to align to stripe size, since that is a fixed constant for the logical drives that controller exports.

And Peter is very right: YMMV depending upon workload. Our load for this system is, as can be inferred from the name of the machine, backups of a raw data set that are processed once and then archived. I/O's per second isn't even on the radar for this workload; throughput, on the other hand, is. And man these Clariions are fast.

Ross Walker

11:37 p.m.

On Sep 29, 2010, at 2:53 PM, Lamar Owen lowen@pari.edu wrote:

...

On Wednesday, September 29, 2010 01:25:11 pm Peter Kjellstrom wrote:

...
You are a bit mistaken. The raid controller does not "copy data around as it sees fit". It stores data on each disk in chunk-size'ed pieces. It then stripes this across all drives giving you a stripe-size'ed piece of chunk size times the number of data drives.

[Snip math]

...
Then again, for other workloads the effect could be insignificant. YMMV.

For a simple RAID controller I can see some benefit.

However, in my case the 'RAID controller' is on SAN, consisting of three EMC Clariion arrays: a CX3-10c, a CX3-80, and a CX700. The EMC Navisphere/Unisphere tools allow LUN migration across RAID groups; I could very well take a LUN from a RAID1/0 with 16 drives to a RAID5 with 9 drives to a RAID6 with 10 drives to a RAID6 with 16 drives and have different stripe sizes. Further, since this is all being accessed through VMware ESX, I'm limited to 2TB LUNs anyway, even using raw device mappings, which I do, but for a different reason; LVM to the rescue to get this: [root@backup-rdc ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 37G 18G 18G 50% / /dev/sda1 99M 26M 69M 28% /boot /dev/mapper/dasch--backup-volume1 21T 19T 2.6T 88% /opt/backups tmpfs 1006M 0 1006M 0% /dev/shm /dev/mapper/dasch--rdc-cx3--80 23T 19T 4.2T 82% /opt/dasch-rdc [root@backup-rdc ~]#

Yeah, the output of pvscan is pretty long (it has been longer, and seeing things like /dev/sdak1 is strange....).

Using XFS at the moment. The two volume groups are on two different arrays; one is on the CX700 and the other on the CX3-80, and they're physically separated at two locations on-campus, with single-mode 4Gb/s FC ISL's between switches. They're soon to be connected to different VMware ESX hosts; the dual fibre-channel connect was so the initial sync time would be reasonable.

I looked through all the performance optimization howtos for XFS that I could find, but then realized how futile that would be with these 'RAID controllers' and their massive caches (our CX3-80 SP's have 8GB of RAM each; the shared write cache and the variable-sized read cache, which I have set up for a rather large size on our CX3-80: 3GB on each SP for read, and 2GB for write; the CX700 has 4GB (actually 3968MB) split 1GB read 2GB write); the benchmarks that I did (that I can't release due to both EMC and VMware's EULAs' prohibitions) showed that the performance differences with alignment versus without were insignificant with these 'RAID controllers'.

But for something inside the server, like a 3ware 9500 or similar, it might be worthwhile to align to stripe size, since that is a fixed constant for the logical drives that controller exports.

And Peter is very right: YMMV depending upon workload. Our load for this system is, as can be inferred from the name of the machine, backups of a raw data set that are processed once and then archived. I/O's per second isn't even on the radar for this workload; throughput, on the other hand, is. And man these Clariions are fast.

For sequential IO you won't notice any impact from misalignment, but for random IO it could be a 25-33% loss.

I'm sure EMC has white papers posted on aligning volumes for Exchange/SQL as well as VMware.

The 8GB cache only goes so far... Get enough server connections or a couple of sequential IO hogs like yours and cache effect disappears quickly.

Often the misalignment starts at the initiator and travels to the target, initiator needs to read two blocks because it is off by one sector, but then the target needs to read two chunks because one of those blocks crosses a chunk, and so on.

-Ross

5517

Age (days ago)

5517

Last active (days ago)

discuss@lists.centos.org

11 comments

7 participants

tags (0)

participants (7)

Boris Epstein
James A. Peltier
Joseph L. Casale
Karanbir Singh
Lamar Owen
Peter Kjellstrom
Ross Walker