On Fri, Oct 09, 2009 at 12:45:14PM -0400, Alan McKay wrote:
Hey folks,
CentOS / PostgreSQL shop over here.
I'm hitting 3 of my favorite lists with this, so here's hoping that the BCC trick is the right way to do it :-)
We've just discovered thanks to a new Munin plugin http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-m... that our production DB is completely maxing out in I/O for about a 3 hour stretch from 6am til 9am This is "device utilization" as per the last graph at the above link.
Load went down for a while but is now between 70% and 95% sustained. We've only had this plugin going for less than a day so I don't really have any more data going back further. But we've suspected a disk issue for some time - just have not been able to prove it.
Really hard to say what's going on. Does your DB need optimization? Do the applications hitting it? Maybe some indexing? Maybe some more RAM on the machine would help? What exactly is the workload like -- especially during the time when you're peaked out?
Is the system swapping? If so, you either need more memory or need to track down a memory leak.... 'free' and 'sar' can both help you see what swap usage is like.
It would be interesting to know which processes are running and consuming IO during this peak period as well. top could probably give you an "OK" picture, but something like iotop or SystemTap could tell you a lot more (unfortunately you'll have to wait for 5.4 to get that functionality I believe).
Writes are always slower on any parity based RAID setup, so I imagine you'd get superior performance on RAID10, especially if you're write heavy.
But to begin with, it'd be interesting to know exactly what this server is doing. Does it makes sense that the disks are being brought to their knees with the given workload?
Is the disk array you bought an N-series? (N3300, N3600)? If so, those are NetApps and should be quite fast thanks to heavy write caching. Even then, you'll be limited by spindles it sounds like...
Ray