new "large" fileserver config questions

List overview All Threads
Download

newer

older

gnome-desktop on Centos-6.3

OCZ Vertex3 SSD and LSI 9211-8i...

Keith Keller

2 Oct 2012 2 Oct '12

3:39 a.m.

Hi all,

I was recently charged with configuring a new fairly large (24x3TB disks) fileserver for my group. I think I know mostly what I want to do with it, but I did have two questions, at least one of which is directly related to CentOS.

1) The controller node has two 90GB SSDs that I plan to use as a bootable RAID1 system disk. What is the preferred method for laying out the RAID array? I found this document on the wiki:

http://wiki.centos.org/HowTos/Install_On_Partitionable_RAID1

But that seems like it's somewhat nonstandard. From what I've read in the RHEL6 docs, it seems like the anaconda-supported RAID1 install is the alternative that the wiki suggests, partitioning the disk and adding the partitions to the appropriate RAID1 array.

So, is there a happy medium, where anaconda more directly supports the partitionable RAID1 install method? And if so, what are the drawbacks to such a configuration? The wiki talks about the advantages but doesn't really address any disadvantages.

2) With large arrays you often hear about "aligning the filesystem to the disk". Is there a fairly standard way (I hope using only CentOS tools) of going about this? Are the various mkfs tools smart enough to figure out how an array is aligned on its own, or is sysadmin intervention required on such large arrays? (If it helps any, the disk array is backed by a 3ware 9750 controller. I have not yet decided how many disks I will use in the array, if that influences the alignment.)

--keith

-- kkeller@wombat.san-francisco.ca.us

Show replies by date

John R Pierce

2 Oct 2 Oct

4:57 a.m.

On 10/01/12 8:39 PM, Keith Keller wrote:

...

The controller node has two 90GB SSDs that I plan to use as a bootable RAID1 system disk. What is the preferred method for laying out the RAID array?

a server makes very little use of its system disks after its booted, everything it needs ends up in cache pretty quickly. and you typically don't reboot a server very often. why waste SSD for that?

I'd rather use SSD for something like LSI Logic's CacheCade v2 (but this requires you use a LSI SAS raid card too)

...

With large arrays you often hear about "aligning the filesystem to

the disk". Is there a fairly standard way (I hope using only CentOS tools) of going about this? Are the various mkfs tools smart enough to figure out how an array is aligned on its own, or is sysadmin intervention required on such large arrays? (If it helps any, the disk array is backed by a 3ware 9750 controller. I have not yet decided how many disks I will use in the array, if that influences the alignment.)

I would suggest not using more than 10-11 disks in a single raid group or the rebuild times get hellaciously long (11 x 3TB SAS2 RAID6 took 12 hours to rebuild when I ran tests). if this is for nearline bulk storage, I'd use 2 disks as hot spares, and have 2 seperate RAID5 or 6 of 11 disks, then stripe those together so its raid 5+0 or 6+0. if this is for higher performance storage, I would build mirrors and stripe them (raid 1+0)

re: alignment, use the whole disks, without partitioning. then there's no alignment issues. use a raid block size of like 32k. if you need multiple file systems, put the whole mess into a single LVM vg, and create your logical volumes in lvm.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Rafa Griman

7:25 a.m.

Hi :)

On Tue, Oct 2, 2012 at 6:57 AM, John R Pierce pierce@hogranch.com wrote:

...

On 10/01/12 8:39 PM, Keith Keller wrote:

...
The controller node has two 90GB SSDs that I plan to use as a bootable RAID1 system disk. What is the preferred method for laying out the RAID array?

a server makes very little use of its system disks after its booted, everything it needs ends up in cache pretty quickly. and you typically don't reboot a server very often. why waste SSD for that?

I'd rather use SSD for something like LSI Logic's CacheCade v2 (but this requires you use a LSI SAS raid card too)

Just add to this comment that you can also use the SSD drives to store the logs/journals/metadata/whatever_you_call_it.

As an example, with XFS you would use the -l option.

Rafa

Nux!

7:59 a.m.

On 02.10.2012 08:25, Rafa Griman wrote:

...

Just add to this comment that you can also use the SSD drives to store the logs/journals/metadata/whatever_you_call_it.

As an example, with XFS you would use the -l option.

Rafa

I'd use the SSDs for bcache/flashcache.

-- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro

Akemi Yagi

9:10 a.m.

On Tue, Oct 2, 2012 at 12:59 AM, Nux! nux@li.nux.ro wrote:

...

I'd use the SSDs for bcache/flashcache.

Try kmod-flashcache [1] and flashcache-utils [2] from ELRepo. Still in the testing repository but seems to work well. Some testimony and additional package by John Newbigin can be found here [3].

Akemi

[1] http://elrepo.org/tiki/kmod-flashcache [2] http://elrepo.org/tiki/flashcache-utils [3] https://groups.google.com/forum/?fromgroups=#!topic/flashcache-dev/sHnurG502...

John R Pierce

4 Oct 4 Oct

2:25 a.m.

On 10/02/12 2:10 AM, Akemi Yagi wrote:

...

On Tue, Oct 2, 2012 at 12:59 AM, Nux!nux@li.nux.ro wrote:

...
...
I'd use the SSDs for bcache/flashcache.

Try kmod-flashcache [1] and flashcache-utils [2] from ELRepo. Still in the testing repository but seems to work well. Some testimony and additional package by John Newbigin can be found here [3].

I'm looking for those, but not seeing them...

# yum list --enablerepo=epel-testing kmod-flashcache Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.ecvps.com * epel: mirrors.kernel.org * epel-testing: mirrors.kernel.org * extras: linux.mirrors.es.net * updates: mirrors.easynews.com Error: No matching Packages to list

# cat /etc/redhat-release CentOS release 6.3 (Final)

# uname -a Linux xxxxx.xxx.domain.com 2.6.32-279.9.1.el6.x86_64 #1 SMP Tue Sep 25 21:43:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

John R Pierce

2:27 a.m.

On 10/03/12 7:25 PM, John R Pierce wrote:

...

I'm looking for those, but not seeing them...

never mind. My eyes saw EPEL when you said ELrepo. :-/

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Keith Keller

2 Oct 2 Oct

6:47 p.m.

On 2012-10-02, John R Pierce pierce@hogranch.com wrote:

...

a server makes very little use of its system disks after its booted, everything it needs ends up in cache pretty quickly. and you typically don't reboot a server very often. why waste SSD for that?

I think the impetus (which I wasn't totally on top of) was to maximize the number of drive bays in the controller node. So the bays are 2.5" instead of 3.5", and finding 2.5" ''enterprise'' SATA drives is fairly nontrivial from what I can tell. I don't actually need 8 2.5" drive bays, so that was an oversight on my part.

After reading the SSD/RAID docs that John Doe posted, I am a little concerned, but I think my plan will be to use these disks as I originally planned, and if they fail too quickly, find some 2.5" magnetic drives and RAID1 them instead. I may also end up putting /tmp, /var, and swap on the disk array instead of on the SSD array, and treat the SSD array as just the write-seldom parts of the OS (e.g., /boot, /usr, /usr/local). If I do that I should be able to alleviate any issues with excessive writing of the SSDs.

I am not sure what drives I have, but I have seen claims of "enterprise" SSDs which are designed to be up 24/7 and be able to tolerate more writes before fatiguing. Has anyone had experience with these drives?

...

re: alignment, use the whole disks, without partitioning. then there's no alignment issues. use a raid block size of like 32k. if you need multiple file systems, put the whole mess into a single LVM vg, and create your logical volumes in lvm.

So, something like mkfs.xfs will be able to determine the proper stride and stripe settings from whatever the 3ware controller presents? (The controller of course uses whole disks, not partitions.) From reading other sites and lists I had the (perhaps mistaken) impression that this was a delicate operation, and not getting it exactly correct would cause performance issues, possibly set fire to the entire data center, and even cause the next big bang.

--keith

-- kkeller@wombat.san-francisco.ca.us

Rafa Griman

3 Oct 3 Oct

7:29 a.m.

Hi :)

On Tue, Oct 2, 2012 at 8:47 PM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...

On 2012-10-02, John R Pierce pierce@hogranch.com wrote:

...
a server makes very little use of its system disks after its booted, everything it needs ends up in cache pretty quickly. and you typically don't reboot a server very often. why waste SSD for that?

I think the impetus (which I wasn't totally on top of) was to maximize the number of drive bays in the controller node. So the bays are 2.5" instead of 3.5", and finding 2.5" ''enterprise'' SATA drives is fairly nontrivial from what I can tell. I don't actually need 8 2.5" drive bays, so that was an oversight on my part.

After reading the SSD/RAID docs that John Doe posted, I am a little concerned, but I think my plan will be to use these disks as I originally planned, and if they fail too quickly, find some 2.5" magnetic drives and RAID1 them instead. I may also end up putting /tmp, /var, and swap on the disk array instead of on the SSD array, and treat the SSD array as just the write-seldom parts of the OS (e.g., /boot, /usr, /usr/local). If I do that I should be able to alleviate any issues with excessive writing of the SSDs.

If it works with you ... I mean, there's no perfect partition scheme (IMHO), depends greatly on what you do, your budget, workflow, file size, ... So if you're happy with this, go ahead. Just some advice: test a couple of different options first just in case ;)

...

I am not sure what drives I have, but I have seen claims of "enterprise" SSDs which are designed to be up 24/7 and be able to tolerate more writes before fatiguing. Has anyone had experience with these drives?

...
re: alignment, use the whole disks, without partitioning. then there's no alignment issues. use a raid block size of like 32k. if you need multiple file systems, put the whole mess into a single LVM vg, and create your logical volumes in lvm.

So, something like mkfs.xfs will be able to determine the proper stride and stripe settings from whatever the 3ware controller presents?

Yup, even though you've got the sw and su options in case you want to play around ... With XFS, you shouldn't have to use su and sw ... in fact you shouldn't have to use many options since it tries to autodetect and use the best options. Check the XFS FAQ.

...

(The controller of course uses whole disks, not partitions.) From reading other sites and lists I had the (perhaps mistaken) impression that this was a delicate operation, and not getting it exactly correct would cause performance issues, possibly set fire to the entire data center, and even cause the next big bang.

Nope, just mass extinction of the Human Race. Nothing to worry about.

HTH

Rafa

Keith Keller

6:01 p.m.

On 2012-10-03, Rafa Griman rafagriman@gmail.com wrote:

...

If it works with you ... I mean, there's no perfect partition scheme (IMHO), depends greatly on what you do, your budget, workflow, file size, ... So if you're happy with this, go ahead. Just some advice: test a couple of different options first just in case ;)

Well, given the warnings about SSD endurance, I didn't want to do excessive testing and contribute to faster wear. But I've been reading around, and perhaps I'm just overreacting. For example:

http://www.storagesearch.com/ssdmyths-endurance.html

This article talks about RAID1 potentially being better for increasing SSD lifetime, despite the full write that mdadm will want to do.

So. For now, let's just pretend that these disks are not SSDs, but regular magnetic disks. Do people have preferences for either of the methods for creating a bootable RAID1 I mentioned in my OP? I like the idea of using a partitionable RAID, but the instructions seem cumbersome. The anaconda method is straightforward, but simply creates RAID1 partitions, AFAICT, which is fine till a disk needs to be replaced, then gets slightly annoying.

...

Yup, even though you've got the sw and su options in case you want to play around ... With XFS, you shouldn't have to use su and sw ... in fact you shouldn't have to use many options since it tries to autodetect and use the best options. Check the XFS FAQ.

Well, I'm also on the XFS list, and there are varying opinions on this.

...

From what I can tell most XFS experts suggest just as you do--don't

second-guess mkfs.xfs, and let it do what it thinks is best. That's certainly what I've done in the past. But there's a vocal group of posters who think this is incredibly foolish, and strongly suggest determing these numbers on your own. If there were a straightforward way to do this with standard CentOS tools (well, plus tw_cli if needed) then I could try both methods and see which worked better. John Doe suggested a guideline which I may try out. But my gut instinct is that I shouldn't try to second-guess mkfs.xfs.

...

Nope, just mass extinction of the Human Race. Nothing to worry about.

So, it's a win-win? ;-)

--keith

-- kkeller@wombat.san-francisco.ca.us

Rafa Griman

4 Oct 4 Oct

12:05 p.m.

Hi :)

On Wed, Oct 3, 2012 at 8:01 PM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...

On 2012-10-03, Rafa Griman rafagriman@gmail.com wrote:

...
If it works with you ... I mean, there's no perfect partition scheme (IMHO), depends greatly on what you do, your budget, workflow, file size, ... So if you're happy with this, go ahead. Just some advice: test a couple of different options first just in case ;)

Well, given the warnings about SSD endurance, I didn't want to do excessive testing and contribute to faster wear. But I've been reading around, and perhaps I'm just overreacting. For example:

http://www.storagesearch.com/ssdmyths-endurance.html

As all new technologies ... starting off is complicated. SSD vendors have developed new "strategies" (leverage existing technology like duplicating the real amount on the SSD) and new algorithms so they're working on it ;)

...

This article talks about RAID1 potentially being better for increasing SSD lifetime, despite the full write that mdadm will want to do.

So. For now, let's just pretend that these disks are not SSDs, but regular magnetic disks. Do people have preferences for either of the methods for creating a bootable RAID1 I mentioned in my OP? I like the idea of using a partitionable RAID, but the instructions seem cumbersome. The anaconda method is straightforward, but simply creates RAID1 partitions, AFAICT, which is fine till a disk needs to be replaced, then gets slightly annoying.

...
Yup, even though you've got the sw and su options in case you want to play around ... With XFS, you shouldn't have to use su and sw ... in fact you shouldn't have to use many options since it tries to autodetect and use the best options. Check the XFS FAQ.

Well, I'm also on the XFS list, and there are varying opinions on this.

...
From what I can tell most XFS experts suggest just as you do--don't

second-guess mkfs.xfs, and let it do what it thinks is best. That's certainly what I've done in the past. But there's a vocal group of posters who think this is incredibly foolish, and strongly suggest determing these numbers on your own. If there were a straightforward way to do this with standard CentOS tools (well, plus tw_cli if needed) then I could try both methods and see which worked better. John Doe suggested a guideline which I may try out. But my gut instinct is that I shouldn't try to second-guess mkfs.xfs.

As always, if you know what you're doing ... feel free to define the parameters/options ;) Oh, and if you've got the time to test different options/values ;) If you know how your app writes/reads to disk, how the RAID cache works, ... you can probably define better options/values ... but that takes a lot of time, testing and reading. XFS' default options might be a bit more conservative, but at least you know they "work".

You have probably seen some XFS list member get "scolded" for messing around with AG (or other options) and then saying performance has dropped. I don't usually mess around with the options and just let mkfs decide ... after all, the XFS devs spend more time benchmarking, reading and testing than me ;)

I've been using XFS for a long time and I'm very happy with how it works out of the box (YMMV).

...

...
Nope, just mass extinction of the Human Race. Nothing to worry about.

So, it's a win-win? ;-)

Definetly :D

Rafa

John Doe

2 Oct 2 Oct

9:30 a.m.

From: Keith Keller kkeller@wombat.san-francisco.ca.us

...

The controller node has two 90GB SSDs that I plan to use as a

bootable RAID1 system disk. What is the preferred method for laying out the RAID array?

See the "Deployment Considerations" about SSDs and RAID: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/ht...

...

With large arrays you often hear about "aligning the filesystem to

the disk". Is there a fairly standard way (I hope using only CentOS tools) of going about this? Are the various mkfs tools smart enough to figure out how an array is aligned on its own, or is sysadmin intervention required on such large arrays? (If it helps any, the disk array is backed by a 3ware 9750 controller. I have not yet decided how many disks I will use in the array, if that influences the alignment.)

From memory: For alignment, first partition starts at 2048. For filesystem, call mkfs with appropriate -E stride=xxx,stripe-width=yyy Stride = RAID Stripeisize_KB / FS blocksize_KB Stripe-width = Stride * RAID_number_of_data_holding_disks (RAID6 = n-2 by example)

4700

Age (days ago)

4702

Last active (days ago)

discuss@lists.centos.org

11 comments

6 participants

tags (0)

participants (6)

Akemi Yagi
John Doe
John R Pierce
Keith Keller
Nux!
Rafa Griman