[CentOS] heavy IO load when working with sparse files (centos 6.4)

Thu Sep 11 04:28:48 UTC 2014
Dave Johansen <davejohansen at gmail.com>

On Mon, Sep 2, 2013 at 12:40 PM, Ron E <ron at questavolta.com> wrote:

> Dear List,
>
> We have noticed a variety of reproducible conditions working with sparse
> files on multiple servers under load with CentOS 6.4.
>
> The short story is that processes that read / write sparse files with
> large "holes" can generate an IO storm. Oddly, this only happens with holes
> and not with the sections of the files that contain data.
>
> We have seen extremely high IO load for example copying a 40 or 80gb
> sparse file that only has a few gigs of data in it. Attempts to lower the
> io priority and cpu priority of these processes do not make any measurable
> difference. (ionice, nice) This has been observed with processes such as:
>
> cp
> rsync
> sha1sum
>
> The server does have to be under some load to reproduce the necessary
> conditions. The cases we have seen involve servers running 10-30 guests
> under kvm. Load is in acceptable norms when the processes are run, such as
> load avg 5-15 on a 24 core (12 core with HT enabled) server. We also verify
> before starting such a process that the spindle with the file we're working
> on is not being unduly hammered by another process.
>
> These servers have one hardware raid controller each (Dell H700 controller
> with write cache enabled) and multiple raid arrays (separate sets of
> physical spindles). Interestingly, the IO storm is not limited to the array
> / spindles where the sparse file resides but affects all IO on that server.
>
> We have looked extensively and not found any account of a similar issue.
> We have seen this on configurations that are 'plain vanilla' enough to
> think that this is not something specific to our environment.
>
> Wondering if anyone else has seen this and if any suggestions on gathering
> more data / troubleshooting. We wonder if we've found either a raid
> controller driver issue, an OS issue or some other such thing. What seems
> to point in this direction is that even with ionice -c3 which should
> prevent the process from using IO unless the storage is idle, an io storm
> which appears to saturate the entire raid bus on a given server can occur.
>

Did you ever figure anything out from this? I've noticed a similar sort of
issue on some of our machines, so I was curious if you found the cause of
the issue or any way to improve the situation.

Thanks,
Dave