[CentOS] disk I/O problems and Solutions

Fri Oct 9 17:28:31 UTC 2009
Ray Van Dolson <rayvd at bludgeon.org>

On Fri, Oct 09, 2009 at 12:45:14PM -0400, Alan McKay wrote:
> Hey folks,
> 
> CentOS / PostgreSQL shop over here.
> 
> I'm hitting 3 of my favorite lists with this, so here's hoping that
> the BCC trick is the right way to do it :-)
> 
> We've just discovered thanks to a new Munin plugin
> http://blogs.amd.co.at/robe/2008/12/graphing-linux-disk-io-statistics-with-munin.html
> that our production DB is completely maxing out in I/O for about a 3
> hour stretch from 6am til 9am
> This is "device utilization" as per the last graph at the above link.
> 
> Load went down for a while but is now between 70% and 95% sustained.
> We've only had this plugin going for less than a day so I don't really
>  have any more data going back further.  But we've suspected a disk
> issue for some time - just have not been able to prove it.

Really hard to say what's going on.  Does your DB need optimization?
Do the applications hitting it?  Maybe some indexing?  Maybe some more
RAM on the machine would help?  What exactly is the workload like --
especially during the time when you're peaked out?

Is the system swapping?  If so, you either need more memory or need to
track down a memory leak.... 'free' and 'sar' can both help you see
what swap usage is like.

It would be interesting to know which processes are running and
consuming IO during this peak period as well.  top could probably give
you an "OK" picture, but something like iotop or SystemTap could tell
you a lot more (unfortunately you'll have to wait for 5.4 to get that
functionality I believe).

Writes are always slower on any parity based RAID setup, so I imagine
you'd get superior performance on RAID10, especially if you're write
heavy.

But to begin with, it'd be interesting to know exactly what this server
is doing.  Does it makes sense that the disks are being brought to
their knees with the given workload?

Is the disk array you bought an N-series? (N3300, N3600)?  If so, those
are NetApps and should be quite fast thanks to heavy write caching.
Even then, you'll be limited by spindles it sounds like...

Ray