Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Our system IBM 3650 - quad 2Ghz e5405 Xeon 8K SAS RAID Controller 6 x 300G 15K/RPM SAS Drives /dev/sda - 2 drives configured as a RAID 1 for 300G for the OS /dev/sdb - 3 drives configured as RAID5 for 600G for the DB 1 drive as a global hot spare
/dev/sdb is the one that is maxing out.
We need to have a very serious look at fixing this situation. But we don't have the money to be experimenting with solutions that won't solve our problem. And our budget is fairly limited.
Is there a public library somewhere of disk subsystems and their performance figures? Done with some semblance of a standard benchmark?
One benchmark I am partial to is this one : http://wiki.postgresql.org/wiki/PgCon_2009/Greg_Smith_Hardware_Benchmarking_...
One thing I am thinking of in the immediate term is taking the RAID5 + hot spare and converting it to RAID10 with the same amount of storage. Will that perform much better?
In general we are planning to move away from RAID5 toward RAID10.
We also have on order an external IBM array (don't have the exact name on hand but model number was 3000) with 12 drive bays. We ordered it with just 4 x SATAII drives, and were going to put it on a different system as a RAID10. These are just 7200 RPM drives - the goal was cheaper storage because the SAS drives are about twice as much per drive, and it is only a 300G drive versus the 1T SATA2 drives. IIRC the SATA2 drives are about $200 each and the SAS 300G drives about $500 each.
So I have 2 thoughts with this 12 disk array. 1 is to fill it up with 12 x cheap SATA2 drives and hope that even though the spin-rate is a lot slower, that the fact that it has more drives will make it perform better. But somehow I am doubtful about that. The other thought is to bite the bullet and fill it up with 300G SAS drives.
any thoughts here? recommendations on what to do with a tight budget? It could be the answer is that I just have to go back to the bean counters and tell them we have no choice but to start spending some real money. But on what? And how do I prove that this is the only choice?
On Fri, Oct 09, 2009 at 12:45:14PM -0400, Alan McKay wrote:
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Really hard to say what's going on. Does your DB need optimization? Do the applications hitting it? Maybe some indexing? Maybe some more RAM on the machine would help? What exactly is the workload like -- especially during the time when you're peaked out?
Is the system swapping? If so, you either need more memory or need to track down a memory leak.... 'free' and 'sar' can both help you see what swap usage is like.
It would be interesting to know which processes are running and consuming IO during this peak period as well. top could probably give you an "OK" picture, but something like iotop or SystemTap could tell you a lot more (unfortunately you'll have to wait for 5.4 to get that functionality I believe).
Writes are always slower on any parity based RAID setup, so I imagine you'd get superior performance on RAID10, especially if you're write heavy.
But to begin with, it'd be interesting to know exactly what this server is doing. Does it makes sense that the disks are being brought to their knees with the given workload?
Is the disk array you bought an N-series? (N3300, N3600)? If so, those are NetApps and should be quite fast thanks to heavy write caching. Even then, you'll be limited by spindles it sounds like...
Ray
On Fri, Oct 9, 2009 at 1:28 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
Really hard to say what's going on. Does your DB need optimization? Do the applications hitting it? Maybe some indexing? Maybe some more RAM on the machine would help? What exactly is the workload like -- especially during the time when you're peaked out?
Yes, these are all things we are just starting to look at - so no clear answers yet
Is the system swapping? If so, you either need more memory or need to track down a memory leak.... 'free' and 'sar' can both help you see what swap usage is like.
That's one thing it does not seem to be doing, fortunatley.
Writes are always slower on any parity based RAID setup, so I imagine you'd get superior performance on RAID10, especially if you're write heavy.
We are generally read heavy but at certain times of day including 6 to 9am we have huge batches of writes.
But to begin with, it'd be interesting to know exactly what this server is doing. Does it makes sense that the disks are being brought to their knees with the given workload?
Oh yes! We are a financial services provider and the Db contains all of the worlds stock prices for about the last 10 years :-) And every day new batches of prices come in (at different times depending on markets). This system is just a PostgreSQL DB. Nothing else.
Is the disk array you bought an N-series? (N3300, N3600)? If so, those are NetApps and should be quite fast thanks to heavy write caching. Even then, you'll be limited by spindles it sounds like...
No, the array is IBM brand. EXP3000 (I've since looked it up)
On Fri, Oct 09, 2009 at 01:34:17PM -0400, Alan McKay wrote:
On Fri, Oct 9, 2009 at 1:28 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
Really hard to say what's going on. Does your DB need optimization? Do the applications hitting it? Maybe some indexing? Maybe some more RAM on the machine would help? What exactly is the workload like -- especially during the time when you're peaked out?
Yes, these are all things we are just starting to look at - so no clear answers yet
Is the system swapping? If so, you either need more memory or need to track down a memory leak.... 'free' and 'sar' can both help you see what swap usage is like.
That's one thing it does not seem to be doing, fortunatley.
Writes are always slower on any parity based RAID setup, so I imagine you'd get superior performance on RAID10, especially if you're write heavy.
We are generally read heavy but at certain times of day including 6 to 9am we have huge batches of writes.
But to begin with, it'd be interesting to know exactly what this server is doing. Does it makes sense that the disks are being brought to their knees with the given workload?
Oh yes! We are a financial services provider and the Db contains all of the worlds stock prices for about the last 10 years :-) And every day new batches of prices come in (at different times depending on markets). This system is just a PostgreSQL DB. Nothing else.
I guess disk prices are cheap enough these days that it might be enough to just throw more spindles and RAID10 everything. Buy some time and continue to optimize things at the software level... :)
Otherwise you could probably figure out if you're hitting an IOP limit on your drive setup, saturating the IO path or dealing with CPU limitations stemming from parity calculations... sometimes those hardware RAID controllers are actually slower than software RAID due to their piddly processors (I don't know what controller you have).
Is write caching enabled (and I hope you have battery backup on the controller if it is)? This helps a LOT w/ RAID4/5/6 as you're probably aware... :)
Is the disk array you bought an N-series? (N3300, N3600)? If so, those are NetApps and should be quite fast thanks to heavy write caching. Even then, you'll be limited by spindles it sounds like...
No, the array is IBM brand. EXP3000 (I've since looked it up)
Ah, should be pretty decent. The EXP we have has write caching as well -- however ours is all FC drives. The larger SATA drives might work out fine if you have a lot of spindles and are doing RAID10... I don't like doing RAID5 much with SATA though as the rebuild times are huge and drives tend to fail more than the SAS resulting in disastrous times for all :P
Interested to know how things turn out for you.
Ray
On Fri, Oct 9, 2009 at 1:34 PM, Alan McKay alan.mckay@gmail.com wrote:
On Fri, Oct 9, 2009 at 1:28 PM, Ray Van Dolson rayvd@bludgeon.org wrote:
Really hard to say what's going on. Does your DB need optimization? Do the applications hitting it? Maybe some indexing? Maybe some more RAM on the machine would help? What exactly is the workload like -- especially during the time when you're peaked out?
Yes, these are all things we are just starting to look at - so no clear answers yet
Is the system swapping? If so, you either need more memory or need to track down a memory leak.... 'free' and 'sar' can both help you see what swap usage is like.
That's one thing it does not seem to be doing, fortunatley.
Writes are always slower on any parity based RAID setup, so I imagine you'd get superior performance on RAID10, especially if you're write heavy.
We are generally read heavy but at certain times of day including 6 to 9am we have huge batches of writes.
I would relocate the log files to a LV on the two disk mirror for now so the writes can batch to log then to database lazily without hitting the database with the RAID5 write penalty.
Since your writes only come in during a 3 hour window in the morning you don't necessarily need RAID10 for the whole thing, but the log files need to be off the RAID5, either on RAID10, RAID1 or SSD.
Oh and make sure your VGs/LVs/partitions are set on the HW RAID alignment boundary and if Postgres and the file system you use has support for chunk size/stripe width you use it.
-Ross
Hi Alan,
You will get the best performance on /dev/sdb by making that a RAID 10 device. The write penalty for R5 is a killer. You will lose a bit of space over R5 but the performance is worth it.
As to the external enclosure, 15k 300g SAS drives are dropping in price if you shop around or ask a sales person for a break since you are buying more that 2-4. Dell gave us a nice break on our last order to ~ $300 per disk. I would use R10 there as well. Do make sure to look at the controller attaching the array. It may be best to let it do the raid on the hardware side. It may be best to do 2 or 3 R10 groups to maximize the channels or one big R10 to take advantage of I/O spindle distribution. Do spread your data around as much as possible.
If you have to use the SATA drives, then definitely fill up the array and spread the load as wide as possible.
Hope this helps.
Tim
On Fri, 2009-10-09 at 12:45 -0400, Alan McKay wrote:
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Our system IBM 3650 - quad 2Ghz e5405 Xeon 8K SAS RAID Controller 6 x 300G 15K/RPM SAS Drives /dev/sda - 2 drives configured as a RAID 1 for 300G for the OS /dev/sdb - 3 drives configured as RAID5 for 600G for the DB 1 drive as a global hot spare
/dev/sdb is the one that is maxing out.
We need to have a very serious look at fixing this situation. But we don't have the money to be experimenting with solutions that won't solve our problem. And our budget is fairly limited.
Is there a public library somewhere of disk subsystems and their performance figures? Done with some semblance of a standard benchmark?
One benchmark I am partial to is this one : http://wiki.postgresql.org/wiki/PgCon_2009/Greg_Smith_Hardware_Benchmarking_...
One thing I am thinking of in the immediate term is taking the RAID5 + hot spare and converting it to RAID10 with the same amount of storage. Will that perform much better?
In general we are planning to move away from RAID5 toward RAID10.
We also have on order an external IBM array (don't have the exact name on hand but model number was 3000) with 12 drive bays. We ordered it with just 4 x SATAII drives, and were going to put it on a different system as a RAID10. These are just 7200 RPM drives - the goal was cheaper storage because the SAS drives are about twice as much per drive, and it is only a 300G drive versus the 1T SATA2 drives. IIRC the SATA2 drives are about $200 each and the SAS 300G drives about $500 each.
So I have 2 thoughts with this 12 disk array. 1 is to fill it up with 12 x cheap SATA2 drives and hope that even though the spin-rate is a lot slower, that the fact that it has more drives will make it perform better. But somehow I am doubtful about that. The other thought is to bite the bullet and fill it up with 300G SAS drives.
any thoughts here? recommendations on what to do with a tight budget? It could be the answer is that I just have to go back to the bean counters and tell them we have no choice but to start spending some real money. But on what? And how do I prove that this is the only choice?
I would second what another poster stated, to look at DB/RAM optimizations. If possible, you should have enough RAM in the system to hold your entire DB in memory - and your db software should be setup to take advantage of that memory.
Since you stated that your system is only having trouble at writes, doing batch inserts, I would look at how you are doing these batch inserts and see if you can optimize this by using techniques to delay key writes or adjust locking mechanisms on the insert. Some databases have special insert from file or bulk insert operations that are much faster than running many single row inserts - be sure you are taking advantage of these methods.
--Blake
Since you stated that your system is only having trouble at writes, doing batch inserts, I would look at how you are doing these batch inserts and see if you can optimize this by using techniques to delay key writes or adjust locking mechanisms on the insert. Some databases have special insert from file or bulk insert operations that are much faster than running many single row inserts - be sure you are taking advantage of these methods.
Yeah, our DB geeks have already said a few weeks ago that they think we are doing these bulk updates in a pretty inefficient manner, so that's something else we plan to optimize.
On Fri, Oct 9, 2009 at 1:55 PM, Alan McKay alan.mckay@gmail.com wrote:
Since you stated that your system is only having trouble at writes, doing batch inserts, I would look at how you are doing these batch inserts and see if you can optimize this by using techniques to delay key writes or adjust locking mechanisms on the insert. Some databases have special insert from file or bulk insert operations that are much faster than running many single row inserts - be sure you are taking advantage of these methods.
Yeah, our DB geeks have already said a few weeks ago that they think we are doing these bulk updates in a pretty inefficient manner, so that's something else we plan to optimize.
In addition to the suggestions above (you should definitely be doing RAID10), take a look at the wiki page on optimizing the disks (http://wiki.centos.org/HowTos/Disk_Optimization). Pay special attention to the sections on "noatime" and "RAID Math", which deals with making sure the filesystem blocks are aligned with the RAID blocks, which can potentially reduce your IOPS per write, giving a reported 15-25% performance improvement. I think the wiki page isn't totally up to date, but it's a good starting point on the concept.
-------- Original Message -------- Subject: Re: [CentOS] disk I/O problems and Solutions From: Alan McKay alan.mckay@gmail.com To: CentOS mailing list centos@centos.org Date: Friday, October 09, 2009 12:55:35 PM
Since you stated that your system is only having trouble at writes, doing batch inserts, I would look at how you are doing these batch inserts and see if you can optimize this by using techniques to delay key writes or adjust locking mechanisms on the insert. Some databases have special insert from file or bulk insert operations that are much faster than running many single row inserts - be sure you are taking advantage of these methods.
Yeah, our DB geeks have already said a few weeks ago that they think we are doing these bulk updates in a pretty inefficient manner, so that's something else we plan to optimize.
I meant to include some annecdotal advice in my last post... In my experience, most of our DB servers rarely access the disk. Though we do configure our db servers for RAID10 now (previously RAID5). RAM is cheap and you should always have plenty of it.
Given your setup, I would be surprised if you would get more than a 100% improvement from transitioning to RAID10 using your current hardware. However, I wouldn't be surprised if you could get a 1,000% to 10,000% improvement through software optimization. This includes proper db software config, db schema, queries, and how you process/prepare data to go in/out of your db.
Keep in mind that if you are disk limited now and top shows no other signs of stress, then you are likely to continue to be disk limited - though you may notice shorter periods of impact.
Hi Alan,
Strictly on a disk I/O issue, I would test diff raid/disk configs and use Iozone with the wks output flag which generates some nice Exel spread sheets and graphs.
I was at a job years ago were I used a combo of strace, ptrace and some other mysql debug utils to determine why the DB/server was unusable at certain times and why.
I used to employ the 'throw pasta against the wall to see which sticks" approach but found it stressful and use Iozone for disk throughput analysis.
On Oct 9, 2009, at 10:55 AM, Alan McKay wrote:
Since you stated that your system is only having trouble at writes, doing batch inserts, I would look at how you are doing these batch inserts and see if you can optimize this by using techniques to delay key writes or adjust locking mechanisms on the insert. Some databases have special insert from file or bulk insert operations that are much faster than running many single row inserts - be sure you are taking advantage of these methods.
Yeah, our DB geeks have already said a few weeks ago that they think we are doing these bulk updates in a pretty inefficient manner, so that's something else we plan to optimize.
-- “Don't eat anything you've ever seen advertised on TV” - Michael Pollan, author of "In Defense of Food" _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Fri, Oct 09, 2009 at 12:45:14PM -0400, Alan McKay wrote:
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Our system IBM 3650 - quad 2Ghz e5405 Xeon 8K SAS RAID Controller 6 x 300G 15K/RPM SAS Drives /dev/sda - 2 drives configured as a RAID 1 for 300G for the OS /dev/sdb - 3 drives configured as RAID5 for 600G for the DB 1 drive as a global hot spare
/dev/sdb is the one that is maxing out.
We need to have a very serious look at fixing this situation. But we don't have the money to be experimenting with solutions that won't solve our problem. And our budget is fairly limited.
Is there a public library somewhere of disk subsystems and their performance figures? Done with some semblance of a standard benchmark?
One benchmark I am partial to is this one : http://wiki.postgresql.org/wiki/PgCon_2009/Greg_Smith_Hardware_Benchmarking_...
One thing I am thinking of in the immediate term is taking the RAID5 + hot spare and converting it to RAID10 with the same amount of storage. Will that perform much better?
Does your RAID controller have battery-backed write-cache? That's needed to get good performance from RAID5.
Changing to RAID-10 will definitely help.
In general we are planning to move away from RAID5 toward RAID10.
That's good. RAID5 isn't very good for database use..
We also have on order an external IBM array (don't have the exact name on hand but model number was 3000) with 12 drive bays. We ordered it with just 4 x SATAII drives, and were going to put it on a different system as a RAID10. These are just 7200 RPM drives - the goal was cheaper storage because the SAS drives are about twice as much per drive, and it is only a 300G drive versus the 1T SATA2 drives. IIRC the SATA2 drives are about $200 each and the SAS 300G drives about $500 each.
So I have 2 thoughts with this 12 disk array. 1 is to fill it up with 12 x cheap SATA2 drives and hope that even though the spin-rate is a lot slower, that the fact that it has more drives will make it perform better. But somehow I am doubtful about that. The other thought is to bite the bullet and fill it up with 300G SAS drives.
any thoughts here? recommendations on what to do with a tight budget? It could be the answer is that I just have to go back to the bean counters and tell them we have no choice but to start spending some real money. But on what? And how do I prove that this is the only choice?
I'd use ltp disktest to measure the performance using various workloads.. sequential and random, using different amount of threads and different blocksizes.
There's only so many random IOPS each disk can do.. 15k SAS drive will be 2-3x faster than 7200 rpm SATA disk, but still 4 disks total isn't THAT much..
-- Pasi
Raid10 should be better on your writes. Random reads and writes are the most important for a db. For random io the more spindles(disks) you have the better. I would use sas over sata if possible. How big is your database? If it is small you may be able to put it on a few solid state drives there really good for random io. I don't know of a website, but one would be nice.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Alan McKay Sent: Friday, October 09, 2009 9:45 AM To: Alan McKay Subject: [CentOS] disk I/O problems and Solutions
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-wi th-munin.html that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Our system IBM 3650 - quad 2Ghz e5405 Xeon 8K SAS RAID Controller 6 x 300G 15K/RPM SAS Drives /dev/sda - 2 drives configured as a RAID 1 for 300G for the OS /dev/sdb - 3 drives configured as RAID5 for 600G for the DB 1 drive as a global hot spare
/dev/sdb is the one that is maxing out.
We need to have a very serious look at fixing this situation. But we don't have the money to be experimenting with solutions that won't solve our problem. And our budget is fairly limited.
Is there a public library somewhere of disk subsystems and their performance figures? Done with some semblance of a standard benchmark?
One benchmark I am partial to is this one : http://wiki.postgresql.org/wiki/PgCon_2009/Greg_Smith_Hardware_Benchmark ing_notes#dd_test
One thing I am thinking of in the immediate term is taking the RAID5 + hot spare and converting it to RAID10 with the same amount of storage. Will that perform much better?
In general we are planning to move away from RAID5 toward RAID10.
We also have on order an external IBM array (don't have the exact name on hand but model number was 3000) with 12 drive bays. We ordered it with just 4 x SATAII drives, and were going to put it on a different system as a RAID10. These are just 7200 RPM drives - the goal was cheaper storage because the SAS drives are about twice as much per drive, and it is only a 300G drive versus the 1T SATA2 drives. IIRC the SATA2 drives are about $200 each and the SAS 300G drives about $500 each.
So I have 2 thoughts with this 12 disk array. 1 is to fill it up with 12 x cheap SATA2 drives and hope that even though the spin-rate is a lot slower, that the fact that it has more drives will make it perform better. But somehow I am doubtful about that. The other thought is to bite the bullet and fill it up with 300G SAS drives.
any thoughts here? recommendations on what to do with a tight budget? It could be the answer is that I just have to go back to the bean counters and tell them we have no choice but to start spending some real money. But on what? And how do I prove that this is the only choice?
On Fri, 9 Oct 2009, Alan McKay wrote:
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
You make no mention of the file system that you are currently using here either. If you look and see I/O wait rather high than your file system could also be the culprit. EXT3/4 journalling is terrible under heavy load because the kjournald is single threaded. We have a user here that completely saturates a 4 core 8GB memory machine with only a few clients because it can't keep up with the journal writes.
Are you seeing any high I/O waits and lots of kjournald's running?
On Fri, 9 Oct 2009, James A. Peltier wrote:
Are you seeing any high I/O waits and lots of kjournald's running?
Sorry, what I meant to say here is, are you seeing logs of long running kjournald's.