On Wed, Sep 10, 2014 at 9:28 PM, Dave Johansen <davejohansen at gmail.com> wrote: > On Mon, Sep 2, 2013 at 12:40 PM, Ron E <ron at questavolta.com> wrote: > >> Dear List, >> >> We have noticed a variety of reproducible conditions working with sparse >> files on multiple servers under load with CentOS 6.4. >> >> The short story is that processes that read / write sparse files with >> large "holes" can generate an IO storm. Oddly, this only happens with holes >> and not with the sections of the files that contain data. >> >> We have seen extremely high IO load for example copying a 40 or 80gb >> sparse file that only has a few gigs of data in it. Attempts to lower the >> io priority and cpu priority of these processes do not make any measurable >> difference. (ionice, nice) This has been observed with processes such as: >> >> cp >> rsync >> sha1sum >> >> The server does have to be under some load to reproduce the necessary >> conditions. The cases we have seen involve servers running 10-30 guests >> under kvm. Load is in acceptable norms when the processes are run, such as >> load avg 5-15 on a 24 core (12 core with HT enabled) server. We also verify >> before starting such a process that the spindle with the file we're working >> on is not being unduly hammered by another process. >> >> These servers have one hardware raid controller each (Dell H700 >> controller with write cache enabled) and multiple raid arrays (separate >> sets of physical spindles). Interestingly, the IO storm is not limited to >> the array / spindles where the sparse file resides but affects all IO on >> that server. >> >> We have looked extensively and not found any account of a similar issue. >> We have seen this on configurations that are 'plain vanilla' enough to >> think that this is not something specific to our environment. >> >> Wondering if anyone else has seen this and if any suggestions on >> gathering more data / troubleshooting. We wonder if we've found either a >> raid controller driver issue, an OS issue or some other such thing. What >> seems to point in this direction is that even with ionice -c3 which should >> prevent the process from using IO unless the storage is idle, an io storm >> which appears to saturate the entire raid bus on a given server can occur. >> > > Did you ever figure anything out from this? I've noticed a similar sort of > issue on some of our machines, so I was curious if you found the cause of > the issue or any way to improve the situation. > I made a simple reproducer of the problem I had observed and the responses on the Fedora mailing list ( https://lists.fedoraproject.org/pipermail/devel/2015-March/209506.html ) were very helpful.