[CentOS] heavy IO load when working with sparse files (centos 6.4)

Tue Mar 31 15:34:34 UTC 2015
Dave Johansen <davejohansen at gmail.com>

On Wed, Sep 10, 2014 at 9:28 PM, Dave Johansen <davejohansen at gmail.com>
wrote:

> On Mon, Sep 2, 2013 at 12:40 PM, Ron E <ron at questavolta.com> wrote:
>
>> Dear List,
>>
>> We have noticed a variety of reproducible conditions working with sparse
>> files on multiple servers under load with CentOS 6.4.
>>
>> The short story is that processes that read / write sparse files with
>> large "holes" can generate an IO storm. Oddly, this only happens with holes
>> and not with the sections of the files that contain data.
>>
>> We have seen extremely high IO load for example copying a 40 or 80gb
>> sparse file that only has a few gigs of data in it. Attempts to lower the
>> io priority and cpu priority of these processes do not make any measurable
>> difference. (ionice, nice) This has been observed with processes such as:
>>
>> cp
>> rsync
>> sha1sum
>>
>> The server does have to be under some load to reproduce the necessary
>> conditions. The cases we have seen involve servers running 10-30 guests
>> under kvm. Load is in acceptable norms when the processes are run, such as
>> load avg 5-15 on a 24 core (12 core with HT enabled) server. We also verify
>> before starting such a process that the spindle with the file we're working
>> on is not being unduly hammered by another process.
>>
>> These servers have one hardware raid controller each (Dell H700
>> controller with write cache enabled) and multiple raid arrays (separate
>> sets of physical spindles). Interestingly, the IO storm is not limited to
>> the array / spindles where the sparse file resides but affects all IO on
>> that server.
>>
>> We have looked extensively and not found any account of a similar issue.
>> We have seen this on configurations that are 'plain vanilla' enough to
>> think that this is not something specific to our environment.
>>
>> Wondering if anyone else has seen this and if any suggestions on
>> gathering more data / troubleshooting. We wonder if we've found either a
>> raid controller driver issue, an OS issue or some other such thing. What
>> seems to point in this direction is that even with ionice -c3 which should
>> prevent the process from using IO unless the storage is idle, an io storm
>> which appears to saturate the entire raid bus on a given server can occur.
>>
>
> Did you ever figure anything out from this? I've noticed a similar sort of
> issue on some of our machines, so I was curious if you found the cause of
> the issue or any way to improve the situation.
>

I made a simple reproducer of the problem I had observed and the responses
on the Fedora mailing list (
https://lists.fedoraproject.org/pipermail/devel/2015-March/209506.html )
were very helpful.