[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes
Sandro Bonazzola
sbonazzo at redhat.com
Mon Apr 10 08:08:21 UTC 2017
Adding Paolo and Miroslav.
On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote <richard at rimote.nl
> wrote:
> Hello,
>
> I would really appreciate some help/guidance with this problem. First of
> all sorry for the long message. I would file a bug, but do not know if it
> is my fault, dm-cache, qemu or (probably) a combination of both. And i can
> imagine some of you have this setup up and running without problems (or
> maybe you think it works, just like i did, but it does not):
>
> PROBLEM
> LVM cache writeback stops working as expected after a while with a
> qemu-kvm VM. A 100% working setup would be the holy grail in my opinion...
> and the performance of KVM/qemu is great i must say in the beginning.
>
> DESCRIPTION
>
> When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create a
> cached LV out of them, the VM performs initially great (at least 40.000
> IOPS on 4k rand read/write)! But then after a while (and a lot of random
> IO, ca 10 - 20 G) it effectively turns in to a writethrough cache although
> there's much space left on the cachedlv.
>
>
> When working as expected on KVM host all writes go to SSDs
>
> iostat -x -m 2
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 324.50 0.00 22.00 0.00 14.94
> 1390.57 1.90 86.39 0.00 86.39 5.32 11.70
> sdb 0.00 324.50 0.00 22.00 0.00 14.94
> 1390.57 2.03 92.45 0.00 92.45 5.48 12.05
> sdc 0.00 3932.00 0.00 *2191.50* 0.00 *270.07*
> 252.39 37.83 17.55 0.00 17.55 0.36 * 78.05*
> sdd 0.00 3932.00 0.00 *2197.50 * 0.00 *271.01 *
> 252.57 38.96 18.14 0.00 18.14 0.36 *78.95*
>
>
> When not working as expected on KVM host all writes go through the SSD on
> to the HDDs (effectively disabling writeback so it becomes a writethrough)
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 7.00 234.50 *173.50 * 0.92 * 1.95*
> 14.38 29.27 71.27 111.89 16.37 2.45 *100.00*
> sdb 0.00 3.50 212.00 *177.50 * 0.83 * 1.95*
> 14.60 35.58 91.24 143.00 29.42 2.57* 100.10*
> sdc 2.50 0.00 566.00 *199.00 * 2.69 0.78
> 9.28 0.08 0.11 0.13 0.04 0.10 *7.70*
> sdd 1.50 0.00 76.00 *199.00* 0.65 0.78
> 10.66 0.02 0.07 0.16 0.04 0.07 *1.85*
>
>
> Stuff i've checked/tried:
>
> - The data in the cached LV has then not exceeded even half of the space,
> so this should not happen. It even happens when only 20% of cachedata is
> used.
> - It seems to be triggerd most of the time when %cpy/sync column of `lvs
> -a` is about 30%. But this is not always the case!
> - changing the cachepolicy from cleaner to smq, wait (check flush ready
> with lvs -a) and then back to smq seems to help *sometimes*! But not
> always...
>
> lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
> lvs -a
>
> lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
> - *when mounting the LV inside the host this does not seem to happen!!*
> So it looks like a qemu-kvm / dm-cache combination issue. Only difference
> is that inside host i do mkfs in stead of LVM inside VM (so could be LVM
> inside VM on top of LVM on KVM host problem too? small chance probably
> because the first 10 - 20GB it works great!)
>
> - tried disabling Selinux, upgrading to newest kernels (elrepo ml and lt),
> played around with dirty_cache thingeys like proc/sys/vm/dirty_writeback_centisecs
> /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , and
> migration threashold of dmsetup, and other probably non important stuff
> like vm.dirty_bytes
>
> - when in "slow state" the systems kworkers are exessively using IO (10 -
> 20 MB per kworker process). This seems to be the writeback process
> (CPY%Sync) because the cache wants to flush to HDD. But the strange thing
> is that after a good sync (0% left), the disk may become slow again after a
> few MBs of data. A reboot sometimes helps.
>
> - have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi
> controller, cachesettings, disk shedulers etc. Nothing helped.
>
> - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have AMD
> FX(tm)-8350, 16G RAM
>
> It feels like the lvm cache has a threshold (about 20G of data that is
> dirty) and that is stops allowing the qemu-kvm process to use writeback
> caching (the root uses inside the host seems to not have this limitation).
> It starts flushing, but only to a certain point. After a few MBs of data
> it is right back in the slow spot again. Only solution is waiting for a
> long time (independant of CPY%SYNC) or sometimes change cachepolicy and
> force flush. This prevents for me the production use of this system. But
> it's so promising, so I hope somebody can help.
>
> desired state: Doing the FIO test (described in section reproduce)
> repeatedly should keep being fast till cachedlv is more or less full. If
> resyncing back to disc causes this degradation, it should actually flush it
> fully within a reasonable time and give opportunity to write fast again up
> to a given threshold. It now seems like a one time use cache that only uses
> a fraction of the SSD and is useless/very unstable afterwards.
>
> REPRODUCE
> 1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep a lot of
> space for the LVM cache (no /home)! So make the VG as large as possible
> during anaconda partitioning.
>
> 2. once installed and booted in to the system, install qemu-kvm
>
> yum install -y centos-release-qemu-ev
> yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
> # disbale ksm (probably not important / needed)
> systemctl disable ksm
> systemctl disable ksmtuned
>
> 3. create LVM cache
>
> #set some variables and create a raid1 array with the two SSDs
>
> VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
> hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm --create
> --verbose ${ssdraiddevice} --level=mirror --bitmap=none --raid-devices=2
> ${ssddevice1} ${ssddevice2}
>
> # create PV and extend VG
>
> pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}
>
> # create slow LV on HDDs (use max space left if you want)
>
> pvdisplay ${hddraiddevice}
> lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
> # create the meta and data: for testing purposes I keep about 20G of the
> SSD for a uncached lv. To rule out it is not the SSD.
>
> lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
> #The rest can be used as cachedata/metadata.
>
> pvdisplay ${ssdraiddevice}
> # about 1/1000 of the space you have left on the SSD for the meta (minimum
> of 4)
> lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
> # the rest can be used as cachedata
> lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
> # convert/combine pools so cachedlv is actually cached
>
> lvconvert --type cache-pool --cachemode writeback --poolmetadata
> ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
> lvconvert --type cache --cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv
>
>
> # my system now looks like (VG is called cl, default of installer)
> [root at localhost ~]# lvs -a
> LV VG Attr LSize Pool Origin
> [cachedata] cl Cwi---C--- 97.66g
>
> * [cachedata_cdata] cl Cwi-ao----
> 97.66g *
> * [cachedata_cmeta] cl ewi-ao---- 100.00m *
>
> * cachedlv cl Cwi-aoC--- 1.75t [cachedata]
> [cachedlv_corig] *
> [cachedlv_corig] cl owi-aoC--- 1.75t
>
> [lvol0_pmspare] cl ewi------- 100.00m
>
> root cl -wi-ao---- 46.56g
>
> swap cl -wi-ao---- 14.96g
>
>
>
> * testssd cl -wi-a----- 45.47g *[root at localhost ~]#lsblk
>
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sdd 8:48 0 163G 0 disk
> └─sdd1 8:49 0 163G 0 part
> └─md128 9:128 0 162.9G 0 raid1
> ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> ├─cl-testssd 253:2 0 45.5G 0 lvm
> └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
> └─cl-cachedlv 253:6 0 1.8T 0 lvm
> sdb 8:16 0 1.8T 0 disk
> ├─sdb2 8:18 0 1.8T 0 part
> │ └─md127 9:127 0 1.8T 0 raid1
> │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
> │ ├─cl-root 253:0 0 46.6G 0 lvm /
> │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> └─sdb1 8:17 0 954M 0 part
> └─md126 9:126 0 954M 0 raid1 /boot
> sdc 8:32 0 163G 0 disk
> └─sdc1 8:33 0 163G 0 part
> └─md128 9:128 0 162.9G 0 raid1
> ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> ├─cl-testssd 253:2 0 45.5G 0 lvm
> └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
> └─cl-cachedlv 253:6 0 1.8T 0 lvm
> sda 8:0 0 1.8T 0 disk
> ├─sda2 8:2 0 1.8T 0 part
> │ └─md127 9:127 0 1.8T 0 raid1
> │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
> │ ├─cl-root 253:0 0 46.6G 0 lvm /
> │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> └─sda1 8:1 0 954M 0 part
> └─md126 9:126 0 954M 0 raid1 /boot
>
> # now create vm
> wget http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-
> x86_64-minimal.iso -P /home/
> DISK=/dev/mapper/XXXX-cachedlv
>
> # watch out, my netsetup uses a custom bridge/network in the following
> command. Please replace with what you normally use.
> virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 --disk
> path=${DISK},cache=none,bus=virtio --network bridge=pubbr,model=virtio
> --cdrom /home/CentOS-6.9-x86_64-minimal.iso --graphics
> vnc,port=5998,listen=0.0.0.0 --cpu host
>
> # now connect with client PC to qemu
> virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system --name CentOS1
>
> And install everything on the single vda disc with LVM (i use defaults in
> anaconda, but remove the large /home to prevent SSD beeing over used).
>
> After install and reboot log in to VM and
>
> yum install epel-release -y && yum install screen fio htop -y
>
> and then run disk test:
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
> --readwrite=randrw --rwmixread=75
>
> then *keep repeating *but *change the filename* attribute so it does not
> use the same blocks over and over again.
>
> In the beginning the performance is great!! Wow, very impressive 150MB/s
> 4k random r/w (close to bare metal, about 20% - 30% loss). But after a few
> (usually about 4 or 5) runs (always changing the filename, but not
> overfilling the FS, it drops to about 10 MBs/sec.
>
> normal/in the beginning
>
> read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
> write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
> but then
>
> read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
> write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
> or even worse up to the point that it is actually the HDD that is written
> to (about 500 iops).
>
> P.S. when a test is/was slow, that means it is on the HDDs. So even when
> fixing the problem (sometimes just by waiting), that specific file will
> keep being slow when redoing the test till its promoted to the lvm cache
> (takes a lot of reads I think). And once on the SSD it sometimes keeps
> beeing fast, although a new testfile will be slow. So I really recommend
> changing the testfile all the time when trying to see if a change in speed
> has occurred.
>
> --
> Met vriendelijke groet,
>
> Richard Landsmanhttp://rimote.nl
>
> T: +31 (0)50 - 763 04 07
> (ma-vr 9:00 tot 18:00)
>
> 24/7 bij storingen:
> +31 (0)6 - 4388 7949
> @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>
>
--
SANDRO BONAZZOLA
ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D
Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig>
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170410/c61d4193/attachment-0001.html>
More information about the CentOS-virt
mailing list