[CentOS-virt] lvm cache + qemu-kvm stops working after about 20GB of writes

Thu Apr 20 10:32:13 UTC 2017
Richard Landsman - Rimote <richard at rimote.nl>

Hello everyone,

Anybody had the chance to test out this setup and reproduce the problem? 
I assumed it would be something that's used often these days and a 
solution would benefit a lot of users. If can be of any assistance 
please contact me.

-- 
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates)

On 04/10/2017 10:08 AM, Sandro Bonazzola wrote:
> Adding Paolo and Miroslav.
>
> On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote 
> <richard at rimote.nl <mailto:richard at rimote.nl>> wrote:
>
>     Hello,
>
>     I would really appreciate some help/guidance with this problem.
>     First of all sorry for the long message. I would file a bug, but
>     do not know if it is my fault, dm-cache, qemu or (probably) a
>     combination of both. And i can imagine some of you have this setup
>     up and running without problems (or maybe you think it works, just
>     like i did, but it does not):
>
>     PROBLEM
>     LVM cache writeback stops working as expected after a while with a
>     qemu-kvm VM. A 100% working setup would be the holy grail in my
>     opinion... and the performance of KVM/qemu is great i must say in
>     the beginning.
>
>     DESCRIPTION
>
>     When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and
>     create a cached LV out of them, the VM performs initially great
>     (at least 40.000 IOPS on 4k rand read/write)! But then after a
>     while (and a lot of random IO, ca 10 - 20 G) it effectively turns
>     in to a writethrough cache although there's much space left on the
>     cachedlv.
>
>
>     When  working as expected on KVM host all writes go to SSDs
>
>     iostat -x -m 2
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>     sda               0.00   324.50    0.00   22.00 0.00    14.94 
>     1390.57     1.90   86.39    0.00 86.39   5.32  11.70
>     sdb               0.00   324.50    0.00   22.00 0.00    14.94 
>     1390.57     2.03   92.45    0.00 92.45   5.48  12.05
>     sdc               0.00  3932.00    0.00 *2191.50* 0.00 *270.07*  
>     252.39    37.83   17.55 0.00   17.55   0.36 *78.05*
>     sdd               0.00  3932.00    0.00 *2197.50 * 0.00 *271.01 * 
>     252.57    38.96   18.14 0.00   18.14   0.36 *78.95*
>
>
>     When not working as expected on KVM host all writes go through the
>     SSD on to the HDDs (effectively disabling writeback so it becomes
>     a writethrough)
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s rMB/s    wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>     sda               0.00     7.00  234.50 *173.50 * 0.92 *1.95*   
>     14.38    29.27   71.27  111.89 16.37   2.45 *100.00*
>     sdb               0.00     3.50  212.00 *177.50 * 0.83 *1.95*   
>     14.60    35.58   91.24  143.00 29.42   2.57*100.10*
>     sdc               2.50     0.00  566.00 *199.00 * 2.69    
>     0.78     9.28     0.08    0.11    0.13 0.04   0.10 *7.70*
>     sdd               1.50     0.00   76.00 *199.00* 0.65     0.78   
>     10.66     0.02    0.07    0.16 0.04   0.07 *1.85*
>
>
>     Stuff i've checked/tried:
>
>     - The data in the cached LV has then not exceeded even half of the
>     space, so this should not happen. It even happens when only 20% of
>     cachedata is used.
>     - It seems to be triggerd most of the time when %cpy/sync column
>     of `lvs -a` is about 30%. But this is not always the case!
>     - changing the cachepolicy from cleaner to smq, wait (check flush
>     ready with lvs -a) and then back to smq seems to help /sometimes/!
>     But not always...
>
>     lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
>     lvs -a
>
>     lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
>     - *when mounting the LV inside the host this does not seem to
>     happen!!* So it looks like a qemu-kvm / dm-cache combination
>     issue. Only difference is that inside host i do mkfs in stead of
>     LVM inside VM (so could be LVM inside VM on top of LVM on KVM host
>     problem too? small chance probably because the first 10 - 20GB it
>     works great!)
>
>     - tried disabling Selinux, upgrading to newest kernels (elrepo ml
>     and lt), played around with dirty_cache thingeys like
>     proc/sys/vm/dirty_writeback_centisecs
>     /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio ,
>     and migration threashold of dmsetup, and other probably non
>     important stuff like vm.dirty_bytes
>
>     - when in "slow state" the systems kworkers are exessively using
>     IO (10 - 20 MB per kworker process). This seems to be the
>     writeback process (CPY%Sync) because the cache wants to flush to
>     HDD. But the strange thing is that after a good sync (0% left),
>     the disk may become slow again after a few MBs of data. A reboot
>     sometimes helps.
>
>     - have tried iothreads, virtio-scsi, vcpu driver setting on
>     virtio-scsi controller, cachesettings, disk shedulers etc. Nothing
>     helped.
>
>     - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have
>     AMD FX(tm)-8350, 16G RAM
>
>     It feels like the lvm cache has a threshold (about 20G of data
>     that is dirty) and that is stops allowing the qemu-kvm process to
>     use writeback caching (the root uses inside the host seems to not
>     have this limitation). It starts flushing, but only to a certain
>     point. After a few  MBs of data it is right back in the slow spot
>     again. Only solution is waiting for a long time (independant of
>     CPY%SYNC) or sometimes change cachepolicy and force flush. This
>     prevents for me the production use of this system. But it's so
>     promising, so I hope somebody can help.
>
>     desired state:  Doing the FIO test (described in section
>     reproduce) repeatedly should keep being fast till cachedlv is more
>     or less full. If resyncing back to disc causes this degradation,
>     it should actually flush it fully within a reasonable time and
>     give opportunity to write fast again up to a given threshold. It
>     now seems like a one time use cache that only uses a fraction of
>     the SSD and is useless/very unstable afterwards.
>
>     REPRODUCE
>     1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep
>     a lot of space for the LVM cache (no /home)! So make the VG as
>     large as possible during anaconda partitioning.
>
>     2. once installed and booted in to the system, install qemu-kvm
>
>     yum install -y centos-release-qemu-ev
>     yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
>     # disbale ksm (probably not important / needed)
>     systemctl disable ksm
>     systemctl disable ksmtuned
>
>     3. create LVM cache
>
>     #set some variables and create a raid1 array with the two SSDs
>
>     VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
>     hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm
>     --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none
>     --raid-devices=2 ${ssddevice1} ${ssddevice2}
>
>     # create PV and extend VG
>
>      pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}
>
>     # create slow LV on HDDs (use max space left if you want)
>
>      pvdisplay ${hddraiddevice}
>      lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
>     # create the meta and data: for testing purposes I keep about 20G
>     of the SSD for a uncached lv. To rule out it is not the SSD.
>
>     lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
>     #The rest can be used as cachedata/metadata.
>
>      pvdisplay ${ssdraiddevice}
>     # about 1/1000 of the space you have left on the SSD for the meta
>     (minimum of 4)
>      lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
>     # the rest can be used as cachedata
>      lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
>     # convert/combine pools so cachedlv is actually cached
>
>      lvconvert --type cache-pool --cachemode writeback --poolmetadata
>     ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
>      lvconvert --type cache --cachepool ${VGBASE}/cachedata
>     ${VGBASE}/cachedlv
>
>
>     # my system now looks like (VG is called cl, default of installer)
>     [root at localhost ~]# lvs -a
>       LV                VG Attr       LSize   Pool Origin
>       [cachedata]       cl Cwi---C--- 97.66g
>     *  [cachedata_cdata] cl Cwi-ao---- 97.66g **
>     **  [cachedata_cmeta] cl ewi-ao---- 100.00m *
>     *  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
>     [cachedlv_corig] *
>       [cachedlv_corig]  cl owi-aoC--- 1.75t
>       [lvol0_pmspare]   cl ewi------- 100.00m
>       root              cl -wi-ao---- 46.56g
>       swap              cl -wi-ao---- 14.96g
>     *  testssd           cl -wi-a-----  45.47g
>
>     *[root at localhost ~]#lsblk*
>     *
>     NAME                     MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>     sdd                        8:48   0   163G  0 disk
>     └─sdd1                     8:49   0   163G  0 part
>       └─md128                  9:128  0 162.9G  0 raid1
>         ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
>         │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
>         ├─cl-testssd         253:2    0  45.5G  0 lvm
>         └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>           └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     sdb                        8:16   0   1.8T  0 disk
>     ├─sdb2                     8:18   0   1.8T  0 part
>     │ └─md127                  9:127  0   1.8T  0 raid1
>     │   ├─cl-swap            253:1    0    15G  0 lvm [SWAP]
>     │   ├─cl-root            253:0    0  46.6G  0 lvm   /
>     │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
>     │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     └─sdb1                     8:17   0   954M  0 part
>       └─md126                  9:126  0   954M  0 raid1 /boot
>     sdc                        8:32   0   163G  0 disk
>     └─sdc1                     8:33   0   163G  0 part
>       └─md128                  9:128  0 162.9G  0 raid1
>         ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm
>         │ └─cl-cachedlv      253:6    0   1.8T  0 lvm
>         ├─cl-testssd         253:2    0  45.5G  0 lvm
>         └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm
>           └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     sda                        8:0    0   1.8T  0 disk
>     ├─sda2                     8:2    0   1.8T  0 part
>     │ └─md127                  9:127  0   1.8T  0 raid1
>     │   ├─cl-swap            253:1    0    15G  0 lvm [SWAP]
>     │   ├─cl-root            253:0    0  46.6G  0 lvm   /
>     │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm
>     │     └─cl-cachedlv      253:6    0   1.8T  0 lvm
>     └─sda1                     8:1    0   954M  0 part
>       └─md126                  9:126  0   954M  0 raid1 /boot
>
>     # now create vm
>     wget
>     http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
>     <http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso>
>     -P /home/
>     DISK=/dev/mapper/XXXX-cachedlv
>
>     # watch out, my netsetup uses a custom bridge/network in the
>     following command. Please replace with what you normally use.
>     virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7
>     --disk path=${DISK},cache=none,bus=virtio --network
>     bridge=pubbr,model=virtio --cdrom
>     /home/CentOS-6.9-x86_64-minimal.iso --graphics
>     vnc,port=5998,listen=0.0.0.0 --cpu host
>
>     # now connect with client PC to qemu
>     virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system --name
>     CentOS1
>
>     And install everything on the single vda disc with LVM (i use
>     defaults in anaconda, but remove the large /home to prevent SSD
>     beeing over used).
>
>     After install and reboot log in to VM and
>
>     yum install epel-release -y && yum install screen fio htop -y
>
>     and then run disk test:
>
>     fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>     --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
>     --readwrite=randrw --rwmixread=75
>
>     then *keep repeating *but *change the filename* attribute so it
>     does not use the same blocks over and over again.
>
>     In the beginning the performance is great!! Wow, very impressive
>     150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss).
>     But after a few (usually about 4 or 5) runs (always changing the
>     filename, but not overfilling the FS, it drops to about 10 MBs/sec.
>
>     normal/in the beginning
>
>      read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
>       write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
>     but then
>
>      read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
>       write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
>     or even worse up to the point that it is actually the HDD that is
>     written to (about 500 iops).
>
>     P.S. when a test is/was slow, that means it is on the HDDs. So
>     even when fixing the problem (sometimes just by waiting), that
>     specific file will keep being slow when redoing the test till its
>     promoted to the lvm cache (takes a lot of reads I think). And once
>     on the SSD it sometimes keeps beeing fast, although a new testfile
>     will be slow. So I really recommend changing the testfile all the
>     time when trying to see if a change in speed has occurred.
>
>     -- 
>     Met vriendelijke groet,
>
>     Richard Landsman
>     http://rimote.nl
>
>     T: +31 (0)50 - 763 04 07
>     (ma-vr 9:00 tot 18:00)
>
>     24/7 bij storingen:
>     +31 (0)6 - 4388 7949
>     @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
>     _______________________________________________
>     CentOS-virt mailing list
>     CentOS-virt at centos.org <mailto:CentOS-virt at centos.org>
>     https://lists.centos.org/mailman/listinfo/centos-virt
>     <https://lists.centos.org/mailman/listinfo/centos-virt>
>
>
>
>
> -- 
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D
>
> Red Hat EMEA <https://www.redhat.com/>
>
> <https://red.ht/sig> 	
> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
>
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170420/8557646a/attachment-0006.html>