Hello everyone, Anybody had the chance to test out this setup and reproduce the problem? I assumed it would be something that's used often these days and a solution would benefit a lot of users. If can be of any assistance please contact me. -- Met vriendelijke groet, Richard Landsman http://rimote.nl T: +31 (0)50 - 763 04 07 (ma-vr 9:00 tot 18:00) 24/7 bij storingen: +31 (0)6 - 4388 7949 @RimoteSaS (Twitter Serviceberichten/security updates) On 04/10/2017 10:08 AM, Sandro Bonazzola wrote: > Adding Paolo and Miroslav. > > On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote > <richard at rimote.nl <mailto:richard at rimote.nl>> wrote: > > Hello, > > I would really appreciate some help/guidance with this problem. > First of all sorry for the long message. I would file a bug, but > do not know if it is my fault, dm-cache, qemu or (probably) a > combination of both. And i can imagine some of you have this setup > up and running without problems (or maybe you think it works, just > like i did, but it does not): > > PROBLEM > LVM cache writeback stops working as expected after a while with a > qemu-kvm VM. A 100% working setup would be the holy grail in my > opinion... and the performance of KVM/qemu is great i must say in > the beginning. > > DESCRIPTION > > When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and > create a cached LV out of them, the VM performs initially great > (at least 40.000 IOPS on 4k rand read/write)! But then after a > while (and a lot of random IO, ca 10 - 20 G) it effectively turns > in to a writethrough cache although there's much space left on the > cachedlv. > > > When working as expected on KVM host all writes go to SSDs > > iostat -x -m 2 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 324.50 0.00 22.00 0.00 14.94 > 1390.57 1.90 86.39 0.00 86.39 5.32 11.70 > sdb 0.00 324.50 0.00 22.00 0.00 14.94 > 1390.57 2.03 92.45 0.00 92.45 5.48 12.05 > sdc 0.00 3932.00 0.00 *2191.50* 0.00 *270.07* > 252.39 37.83 17.55 0.00 17.55 0.36 *78.05* > sdd 0.00 3932.00 0.00 *2197.50 * 0.00 *271.01 * > 252.57 38.96 18.14 0.00 18.14 0.36 *78.95* > > > When not working as expected on KVM host all writes go through the > SSD on to the HDDs (effectively disabling writeback so it becomes > a writethrough) > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 7.00 234.50 *173.50 * 0.92 *1.95* > 14.38 29.27 71.27 111.89 16.37 2.45 *100.00* > sdb 0.00 3.50 212.00 *177.50 * 0.83 *1.95* > 14.60 35.58 91.24 143.00 29.42 2.57*100.10* > sdc 2.50 0.00 566.00 *199.00 * 2.69 > 0.78 9.28 0.08 0.11 0.13 0.04 0.10 *7.70* > sdd 1.50 0.00 76.00 *199.00* 0.65 0.78 > 10.66 0.02 0.07 0.16 0.04 0.07 *1.85* > > > Stuff i've checked/tried: > > - The data in the cached LV has then not exceeded even half of the > space, so this should not happen. It even happens when only 20% of > cachedata is used. > - It seems to be triggerd most of the time when %cpy/sync column > of `lvs -a` is about 30%. But this is not always the case! > - changing the cachepolicy from cleaner to smq, wait (check flush > ready with lvs -a) and then back to smq seems to help /sometimes/! > But not always... > > lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv > > lvs -a > > lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv > > - *when mounting the LV inside the host this does not seem to > happen!!* So it looks like a qemu-kvm / dm-cache combination > issue. Only difference is that inside host i do mkfs in stead of > LVM inside VM (so could be LVM inside VM on top of LVM on KVM host > problem too? small chance probably because the first 10 - 20GB it > works great!) > > - tried disabling Selinux, upgrading to newest kernels (elrepo ml > and lt), played around with dirty_cache thingeys like > proc/sys/vm/dirty_writeback_centisecs > /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , > and migration threashold of dmsetup, and other probably non > important stuff like vm.dirty_bytes > > - when in "slow state" the systems kworkers are exessively using > IO (10 - 20 MB per kworker process). This seems to be the > writeback process (CPY%Sync) because the cache wants to flush to > HDD. But the strange thing is that after a good sync (0% left), > the disk may become slow again after a few MBs of data. A reboot > sometimes helps. > > - have tried iothreads, virtio-scsi, vcpu driver setting on > virtio-scsi controller, cachesettings, disk shedulers etc. Nothing > helped. > > - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have > AMD FX(tm)-8350, 16G RAM > > It feels like the lvm cache has a threshold (about 20G of data > that is dirty) and that is stops allowing the qemu-kvm process to > use writeback caching (the root uses inside the host seems to not > have this limitation). It starts flushing, but only to a certain > point. After a few MBs of data it is right back in the slow spot > again. Only solution is waiting for a long time (independant of > CPY%SYNC) or sometimes change cachepolicy and force flush. This > prevents for me the production use of this system. But it's so > promising, so I hope somebody can help. > > desired state: Doing the FIO test (described in section > reproduce) repeatedly should keep being fast till cachedlv is more > or less full. If resyncing back to disc causes this degradation, > it should actually flush it fully within a reasonable time and > give opportunity to write fast again up to a given threshold. It > now seems like a one time use cache that only uses a fraction of > the SSD and is useless/very unstable afterwards. > > REPRODUCE > 1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep > a lot of space for the LVM cache (no /home)! So make the VG as > large as possible during anaconda partitioning. > > 2. once installed and booted in to the system, install qemu-kvm > > yum install -y centos-release-qemu-ev > yum install -y qemu-kvm-ev libvirt bridge-utils net-tools > # disbale ksm (probably not important / needed) > systemctl disable ksm > systemctl disable ksmtuned > > 3. create LVM cache > > #set some variables and create a raid1 array with the two SSDs > > VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 && > hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm > --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none > --raid-devices=2 ${ssddevice1} ${ssddevice2} > > # create PV and extend VG > > pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice} > > # create slow LV on HDDs (use max space left if you want) > > pvdisplay ${hddraiddevice} > lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice} > > # create the meta and data: for testing purposes I keep about 20G > of the SSD for a uncached lv. To rule out it is not the SSD. > > lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice} > > #The rest can be used as cachedata/metadata. > > pvdisplay ${ssdraiddevice} > # about 1/1000 of the space you have left on the SSD for the meta > (minimum of 4) > lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice} > # the rest can be used as cachedata > lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice} > > # convert/combine pools so cachedlv is actually cached > > lvconvert --type cache-pool --cachemode writeback --poolmetadata > ${VGBASE}/cachemeta ${VGBASE}/cachedata > > lvconvert --type cache --cachepool ${VGBASE}/cachedata > ${VGBASE}/cachedlv > > > # my system now looks like (VG is called cl, default of installer) > [root at localhost ~]# lvs -a > LV VG Attr LSize Pool Origin > [cachedata] cl Cwi---C--- 97.66g > * [cachedata_cdata] cl Cwi-ao---- 97.66g ** > ** [cachedata_cmeta] cl ewi-ao---- 100.00m * > * cachedlv cl Cwi-aoC--- 1.75t [cachedata] > [cachedlv_corig] * > [cachedlv_corig] cl owi-aoC--- 1.75t > [lvol0_pmspare] cl ewi------- 100.00m > root cl -wi-ao---- 46.56g > swap cl -wi-ao---- 14.96g > * testssd cl -wi-a----- 45.47g > > *[root at localhost ~]#lsblk* > * > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > sdd 8:48 0 163G 0 disk > └─sdd1 8:49 0 163G 0 part > └─md128 9:128 0 162.9G 0 raid1 > ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm > │ └─cl-cachedlv 253:6 0 1.8T 0 lvm > ├─cl-testssd 253:2 0 45.5G 0 lvm > └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm > └─cl-cachedlv 253:6 0 1.8T 0 lvm > sdb 8:16 0 1.8T 0 disk > ├─sdb2 8:18 0 1.8T 0 part > │ └─md127 9:127 0 1.8T 0 raid1 > │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP] > │ ├─cl-root 253:0 0 46.6G 0 lvm / > │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm > │ └─cl-cachedlv 253:6 0 1.8T 0 lvm > └─sdb1 8:17 0 954M 0 part > └─md126 9:126 0 954M 0 raid1 /boot > sdc 8:32 0 163G 0 disk > └─sdc1 8:33 0 163G 0 part > └─md128 9:128 0 162.9G 0 raid1 > ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm > │ └─cl-cachedlv 253:6 0 1.8T 0 lvm > ├─cl-testssd 253:2 0 45.5G 0 lvm > └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm > └─cl-cachedlv 253:6 0 1.8T 0 lvm > sda 8:0 0 1.8T 0 disk > ├─sda2 8:2 0 1.8T 0 part > │ └─md127 9:127 0 1.8T 0 raid1 > │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP] > │ ├─cl-root 253:0 0 46.6G 0 lvm / > │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm > │ └─cl-cachedlv 253:6 0 1.8T 0 lvm > └─sda1 8:1 0 954M 0 part > └─md126 9:126 0 954M 0 raid1 /boot > > # now create vm > wget > http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso > <http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso> > -P /home/ > DISK=/dev/mapper/XXXX-cachedlv > > # watch out, my netsetup uses a custom bridge/network in the > following command. Please replace with what you normally use. > virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 > --disk path=${DISK},cache=none,bus=virtio --network > bridge=pubbr,model=virtio --cdrom > /home/CentOS-6.9-x86_64-minimal.iso --graphics > vnc,port=5998,listen=0.0.0.0 --cpu host > > # now connect with client PC to qemu > virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system --name > CentOS1 > > And install everything on the single vda disc with LVM (i use > defaults in anaconda, but remove the large /home to prevent SSD > beeing over used). > > After install and reboot log in to VM and > > yum install epel-release -y && yum install screen fio htop -y > > and then run disk test: > > fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 > --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G > --readwrite=randrw --rwmixread=75 > > then *keep repeating *but *change the filename* attribute so it > does not use the same blocks over and over again. > > In the beginning the performance is great!! Wow, very impressive > 150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss). > But after a few (usually about 4 or 5) runs (always changing the > filename, but not overfilling the FS, it drops to about 10 MBs/sec. > > normal/in the beginning > > read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec > write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec > > but then > > read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec > write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec > > or even worse up to the point that it is actually the HDD that is > written to (about 500 iops). > > P.S. when a test is/was slow, that means it is on the HDDs. So > even when fixing the problem (sometimes just by waiting), that > specific file will keep being slow when redoing the test till its > promoted to the lvm cache (takes a lot of reads I think). And once > on the SSD it sometimes keeps beeing fast, although a new testfile > will be slow. So I really recommend changing the testfile all the > time when trying to see if a change in speed has occurred. > > -- > Met vriendelijke groet, > > Richard Landsman > http://rimote.nl > > T: +31 (0)50 - 763 04 07 > (ma-vr 9:00 tot 18:00) > > 24/7 bij storingen: > +31 (0)6 - 4388 7949 > @RimoteSaS (Twitter Serviceberichten/security updates) > > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org <mailto:CentOS-virt at centos.org> > https://lists.centos.org/mailman/listinfo/centos-virt > <https://lists.centos.org/mailman/listinfo/centos-virt> > > > > > -- > > SANDRO BONAZZOLA > > ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D > > Red Hat EMEA <https://www.redhat.com/> > > <https://red.ht/sig> > TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > > > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > https://lists.centos.org/mailman/listinfo/centos-virt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170420/8557646a/attachment-0006.html>