[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Adding Paolo and Miroslav.

On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote <richard at rimote.nl
> wrote:

> Hello,
>
> I would really appreciate some help/guidance with this problem. First of
> all sorry for the long message. I would file a bug, but do not know if it
> is my fault, dm-cache, qemu or (probably) a combination of both. And i can
> imagine some of you have this setup up and running without problems (or
> maybe you think it works, just like i did, but it does not):
>
> PROBLEM
> LVM cache writeback stops working as expected after a while with a
> qemu-kvm VM. A 100% working setup would be the holy grail in my opinion...
> and the performance of KVM/qemu is great i must say in the beginning.
>
> DESCRIPTION
>
> When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create a
> cached LV out of them, the VM performs initially great (at least 40.000
> IOPS on 4k rand read/write)! But then after a while (and a lot of random
> IO, ca 10 - 20 G) it effectively turns in to a writethrough cache although
> there's much space left on the cachedlv.
>
>
> When  working as expected on KVM host all writes go to SSDs
>
> iostat -x -m 2
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00   324.50    0.00   22.00     0.00    14.94
> 1390.57     1.90   86.39    0.00   86.39   5.32  11.70
> sdb               0.00   324.50    0.00   22.00     0.00    14.94
> 1390.57     2.03   92.45    0.00   92.45   5.48  12.05
> sdc               0.00  3932.00    0.00 *2191.50*     0.00   *270.07*
> 252.39    37.83   17.55    0.00   17.55   0.36 * 78.05*
> sdd               0.00  3932.00    0.00 *2197.50 *    0.00   *271.01 *
> 252.57    38.96   18.14    0.00   18.14   0.36  *78.95*
>
>
> When not working as expected on KVM host all writes go through the SSD on
> to the HDDs (effectively disabling writeback so it becomes a writethrough)
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     7.00  234.50  *173.50 *    0.92    * 1.95*
> 14.38    29.27   71.27  111.89   16.37   2.45 *100.00*
> sdb               0.00     3.50  212.00  *177.50 *    0.83    * 1.95*
> 14.60    35.58   91.24  143.00   29.42   2.57* 100.10*
> sdc               2.50     0.00  566.00  *199.00 *    2.69     0.78
> 9.28     0.08    0.11    0.13    0.04   0.10   *7.70*
> sdd               1.50     0.00   76.00  *199.00*     0.65     0.78
> 10.66     0.02    0.07    0.16    0.04   0.07   *1.85*
>
>
> Stuff i've checked/tried:
>
> - The data in the cached LV has then not exceeded even half of the space,
> so this should not happen. It even happens when only 20% of cachedata is
> used.
> - It seems to be triggerd most of the time when %cpy/sync column of `lvs
> -a` is about 30%. But this is not always the case!
> - changing the cachepolicy from cleaner to smq, wait (check flush ready
> with lvs -a) and then back to smq seems to help *sometimes*! But not
> always...
>
> lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
> lvs -a
>
> lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
> - *when mounting the LV inside the host this does not seem to happen!!*
> So it looks like a qemu-kvm / dm-cache combination issue. Only difference
> is that inside host i do mkfs in stead of LVM inside VM (so could be LVM
> inside VM on top of LVM on KVM host problem too? small chance probably
> because the first 10 - 20GB it works great!)
>
> - tried disabling Selinux, upgrading to newest kernels (elrepo ml and lt),
> played around with dirty_cache thingeys like proc/sys/vm/dirty_writeback_centisecs
> /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio , and
> migration threashold of dmsetup, and other probably non important stuff
> like vm.dirty_bytes
>
> - when in "slow state" the systems kworkers are exessively using IO (10 -
> 20 MB per kworker process). This seems to be the writeback process
> (CPY%Sync) because the cache wants to flush to HDD. But the strange thing
> is that after a good sync (0% left), the disk may become slow again after a
> few MBs of data. A reboot sometimes helps.
>
> - have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi
> controller, cachesettings, disk shedulers etc. Nothing helped.
>
> - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have AMD
> FX(tm)-8350, 16G RAM
>
> It feels like the lvm cache has a threshold (about 20G of data that is
> dirty) and that is stops allowing the qemu-kvm process to use writeback
> caching (the root uses inside the host seems to not have this limitation).
> It starts flushing, but only to a certain point. After a few  MBs of data
> it is right back in the slow spot again. Only solution is waiting for a
> long time (independant of CPY%SYNC) or sometimes change cachepolicy and
> force flush. This prevents for me the production use of this system. But
> it's so promising, so I hope somebody can help.
>
> desired state:  Doing the FIO test (described in section reproduce)
> repeatedly should keep being fast till cachedlv is more or less full. If
> resyncing back to disc causes this degradation, it should actually flush it
> fully within a reasonable time and give opportunity to write fast again up
> to a given threshold. It now seems like a one time use cache that only uses
> a fraction of the SSD and is useless/very unstable afterwards.
>
> REPRODUCE
> 1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep a lot of
> space for the LVM cache (no /home)! So make the VG as large as possible
> during anaconda partitioning.
>
> 2. once installed and booted in to the system, install qemu-kvm
>
> yum install -y centos-release-qemu-ev
> yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
> # disbale ksm (probably not important / needed)
> systemctl disable ksm
> systemctl disable ksmtuned
>
> 3. create LVM cache
>
> #set some variables and create a raid1 array with the two SSDs
>
> VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
> hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm --create
> --verbose ${ssdraiddevice} --level=mirror --bitmap=none --raid-devices=2
> ${ssddevice1} ${ssddevice2}
>
> # create PV and extend VG
>
>  pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}
>
> # create slow LV on HDDs (use max space left if you want)
>
>  pvdisplay ${hddraiddevice}
>  lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
> # create the meta and data: for testing purposes I keep about 20G of the
> SSD for a uncached lv. To rule out it is not the SSD.
>
> lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
> #The rest can be used as cachedata/metadata.
>
>  pvdisplay ${ssdraiddevice}
> # about 1/1000 of the space you have left on the SSD for the meta (minimum
> of 4)
>  lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
> # the rest can be used as cachedata
>  lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
> # convert/combine pools so cachedlv is actually cached
>
>  lvconvert --type cache-pool --cachemode writeback --poolmetadata
> ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
>  lvconvert --type cache --cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv
>
>
> # my system now looks like (VG is called cl, default of installer)
> [root at localhost ~]# lvs -a
>   LV                VG Attr       LSize   Pool        Origin
>   [cachedata]       cl Cwi---C---  97.66g
>
> *  [cachedata_cdata] cl Cwi-ao----
> 97.66g                                                                     *
> *  [cachedata_cmeta] cl ewi-ao---- 100.00m     *
>
> *  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
> [cachedlv_corig]     *
>   [cachedlv_corig]  cl owi-aoC---   1.75t
>
>   [lvol0_pmspare]   cl ewi------- 100.00m
>
>   root              cl -wi-ao----  46.56g
>
>   swap              cl -wi-ao----  14.96g
>
>
>
> *  testssd           cl -wi-a-----  45.47g *[root at localhost ~]#lsblk
>
> NAME                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> sdd                        8:48   0   163G  0 disk
> └─sdd1                     8:49   0   163G  0 part
>   └─md128                  9:128  0 162.9G  0 raid1
>     ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
>     │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     ├─cl-testssd         253:2    0  45.5G  0 lvm
>     └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>       └─cl-cachedlv      253:6    0   1.8T  0 lvm
> sdb                        8:16   0   1.8T  0 disk
> ├─sdb2                     8:18   0   1.8T  0 part
> │ └─md127                  9:127  0   1.8T  0 raid1
> │   ├─cl-swap            253:1    0    15G  0 lvm   [SWAP]
> │   ├─cl-root            253:0    0  46.6G  0 lvm   /
> │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
> │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
> └─sdb1                     8:17   0   954M  0 part
>   └─md126                  9:126  0   954M  0 raid1 /boot
> sdc                        8:32   0   163G  0 disk
> └─sdc1                     8:33   0   163G  0 part
>   └─md128                  9:128  0 162.9G  0 raid1
>     ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
>     │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     ├─cl-testssd         253:2    0  45.5G  0 lvm
>     └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>       └─cl-cachedlv      253:6    0   1.8T  0 lvm
> sda                        8:0    0   1.8T  0 disk
> ├─sda2                     8:2    0   1.8T  0 part
> │ └─md127                  9:127  0   1.8T  0 raid1
> │   ├─cl-swap            253:1    0    15G  0 lvm   [SWAP]
> │   ├─cl-root            253:0    0  46.6G  0 lvm   /
> │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
> │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
> └─sda1                     8:1    0   954M  0 part
>   └─md126                  9:126  0   954M  0 raid1 /boot
>
> # now create vm
> wget http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-
> x86_64-minimal.iso -P /home/
> DISK=/dev/mapper/XXXX-cachedlv
>
> # watch out, my netsetup uses a custom bridge/network in the following
> command. Please replace with what you normally use.
> virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 --disk
> path=${DISK},cache=none,bus=virtio --network bridge=pubbr,model=virtio
> --cdrom /home/CentOS-6.9-x86_64-minimal.iso --graphics
> vnc,port=5998,listen=0.0.0.0 --cpu host
>
> # now connect with client PC to qemu
> virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system --name CentOS1
>
> And install everything on the single vda disc with LVM (i use defaults in
> anaconda, but remove the large /home to prevent SSD beeing over used).
>
> After install and reboot log in to VM and
>
> yum install epel-release -y && yum install screen fio htop -y
>
> and then run disk test:
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
> --readwrite=randrw --rwmixread=75
>
> then *keep repeating *but *change the filename* attribute so it does not
> use the same blocks over and over again.
>
> In the beginning the performance is great!! Wow, very impressive 150MB/s
> 4k random r/w (close to bare metal, about 20% - 30% loss). But after a few
> (usually about 4 or 5) runs (always changing the filename, but not
> overfilling the FS, it drops to about 10 MBs/sec.
>
> normal/in the beginning
>
>  read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
>   write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
> but then
>
>  read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
>   write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
> or even worse up to the point that it is actually the HDD that is written
> to (about 500 iops).
>
> P.S. when a test is/was slow, that means it is on the HDDs. So even when
> fixing the problem (sometimes just by waiting), that specific file will
> keep being slow when redoing the test till its promoted to the lvm cache
> (takes a lot of reads I think). And once on the SSD it sometimes keeps
> beeing fast, although a new testfile will be slow. So I really recommend
> changing the testfile all the time when trying to see if a change in speed
> has occurred.
>
> --
> Met vriendelijke groet,
>
> Richard Landsmanhttp://rimote.nl
>
> T: +31 (0)50 - 763 04 07
> (ma-vr 9:00 tot 18:00)
>
> 24/7 bij storingen:
> +31 (0)6 - 4388 7949
> @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>
>

-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig>
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170410/c61d4193/attachment-0003.html>