Dear List,
We have noticed a variety of reproducible conditions working with sparse files on multiple servers under load with CentOS 6.4.
The short story is that processes that read / write sparse files with large "holes" can generate an IO storm. Oddly, this only happens with holes and not with the sections of the files that contain data.
We have seen extremely high IO load for example copying a 40 or 80gb sparse file that only has a few gigs of data in it. Attempts to lower the io priority and cpu priority of these processes do not make any measurable difference. (ionice, nice) This has been observed with processes such as:
cp
rsync
sha1sum
The server does have to be under some load to reproduce the necessary conditions. The cases we have seen involve servers running 10-30 guests under kvm. Load is in acceptable norms when the processes are run, such as load avg 5-15 on a 24 core (12 core with HT enabled) server. We also verify before starting such a process that the spindle with the file we're working on is not being unduly hammered by another process.
These servers have one hardware raid controller each (Dell H700 controller with write cache enabled) and multiple raid arrays (separate sets of physical spindles). Interestingly, the IO storm is not limited to the array / spindles where the sparse file resides but affects all IO on that server.
We have looked extensively and not found any account of a similar issue. We have seen this on configurations that are 'plain vanilla' enough to think that this is not something specific to our environment.
Wondering if anyone else has seen this and if any suggestions on gathering more data / troubleshooting. We wonder if we've found either a raid controller driver issue, an OS issue or some other such thing. What seems to point in this direction is that even with ionice -c3 which should prevent the process from using IO unless the storage is idle, an io storm which appears to saturate the entire raid bus on a given server can occur.
Thanks in advance.