Anybody had the chance to test out this setup and reproduce the problem?
I assumed it would be something that's used often these days and a
solution would benefit a lot of users. If can be of any assistance
please contact me.
--
Met vriendelijke groet,
Richard Landsman
http://rimote.nl
T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)
24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates)
On 04/10/2017 10:08 AM, Sandro Bonazzola wrote:
> Adding Paolo and Miroslav.
>
> On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote
> <richard@rimote.nl
mailto:richard@rimote.nl> wrote:
>
> Hello,
>
> I would really appreciate some help/guidance with this problem.
> First of all sorry for the long message. I would file a bug, but
> do not know if it is my fault, dm-cache, qemu or (probably) a
> combination of both. And i can imagine some of you have this setup
> up and running without problems (or maybe you think it works, just
> like i did, but it does not):
>
> PROBLEM
> LVM cache writeback stops working as expected after a while with a
> qemu-kvm VM. A 100% working setup would be the holy grail in my
> opinion... and the performance of KVM/qemu is great i must say in
> the beginning.
>
> DESCRIPTION
>
> When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and
> create a cached LV out of them, the VM performs initially great
> (at least 40.000 IOPS on 4k rand read/write)! But then after a
> while (and a lot of random IO, ca 10 - 20 G) it effectively turns
> in to a writethrough cache although there's much space left on the
> cachedlv.
>
>
> When working as expected on KVM host all writes go to SSDs
>
> iostat -x -m 2
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 324.50 0.00 22.00 0.00 14.94
> 1390.57 1.90 86.39 0.00 86.39 5.32 11.70
> sdb 0.00 324.50 0.00 22.00 0.00 14.94
> 1390.57 2.03 92.45 0.00 92.45 5.48 12.05
> sdc 0.00 3932.00 0.00 *2191.50* 0.00 *270.07*
> 252.39 37.83 17.55 0.00 17.55 0.36 *78.05*
> sdd 0.00 3932.00 0.00 *2197.50 * 0.00 *271.01 *
> 252.57 38.96 18.14 0.00 18.14 0.36 *78.95*
>
>
> When not working as expected on KVM host all writes go through the
> SSD on to the HDDs (effectively disabling writeback so it becomes
> a writethrough)
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 7.00 234.50 *173.50 * 0.92 *1.95*
> 14.38 29.27 71.27 111.89 16.37 2.45 *100.00*
> sdb 0.00 3.50 212.00 *177.50 * 0.83 *1.95*
> 14.60 35.58 91.24 143.00 29.42 2.57*100.10*
> sdc 2.50 0.00 566.00 *199.00 * 2.69
> 0.78 9.28 0.08 0.11 0.13 0.04 0.10 *7.70*
> sdd 1.50 0.00 76.00 *199.00* 0.65 0.78
> 10.66 0.02 0.07 0.16 0.04 0.07 *1.85*
>
>
> Stuff i've checked/tried:
>
> - The data in the cached LV has then not exceeded even half of the
> space, so this should not happen. It even happens when only 20% of
> cachedata is used.
> - It seems to be triggerd most of the time when %cpy/sync column
> of `lvs -a` is about 30%. But this is not always the case!
> - changing the cachepolicy from cleaner to smq, wait (check flush
> ready with lvs -a) and then back to smq seems to help /sometimes/!
> But not always...
>
> lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
>
> lvs -a
>
> lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv
>
> - *when mounting the LV inside the host this does not seem to
> happen!!* So it looks like a qemu-kvm / dm-cache combination
> issue. Only difference is that inside host i do mkfs in stead of
> LVM inside VM (so could be LVM inside VM on top of LVM on KVM host
> problem too? small chance probably because the first 10 - 20GB it
> works great!)
>
> - tried disabling Selinux, upgrading to newest kernels (elrepo ml
> and lt), played around with dirty_cache thingeys like
> proc/sys/vm/dirty_writeback_centisecs
> /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio ,
> and migration threashold of dmsetup, and other probably non
> important stuff like vm.dirty_bytes
>
> - when in "slow state" the systems kworkers are exessively using
> IO (10 - 20 MB per kworker process). This seems to be the
> writeback process (CPY%Sync) because the cache wants to flush to
> HDD. But the strange thing is that after a good sync (0% left),
> the disk may become slow again after a few MBs of data. A reboot
> sometimes helps.
>
> - have tried iothreads, virtio-scsi, vcpu driver setting on
> virtio-scsi controller, cachesettings, disk shedulers etc. Nothing
> helped.
>
> - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have
> AMD FX(tm)-8350, 16G RAM
>
> It feels like the lvm cache has a threshold (about 20G of data
> that is dirty) and that is stops allowing the qemu-kvm process to
> use writeback caching (the root uses inside the host seems to not
> have this limitation). It starts flushing, but only to a certain
> point. After a few MBs of data it is right back in the slow spot
> again. Only solution is waiting for a long time (independant of
> CPY%SYNC) or sometimes change cachepolicy and force flush. This
> prevents for me the production use of this system. But it's so
> promising, so I hope somebody can help.
>
> desired state: Doing the FIO test (described in section
> reproduce) repeatedly should keep being fast till cachedlv is more
> or less full. If resyncing back to disc causes this degradation,
> it should actually flush it fully within a reasonable time and
> give opportunity to write fast again up to a given threshold. It
> now seems like a one time use cache that only uses a fraction of
> the SSD and is useless/very unstable afterwards.
>
> REPRODUCE
> 1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep
> a lot of space for the LVM cache (no /home)! So make the VG as
> large as possible during anaconda partitioning.
>
> 2. once installed and booted in to the system, install qemu-kvm
>
> yum install -y centos-release-qemu-ev
> yum install -y qemu-kvm-ev libvirt bridge-utils net-tools
> # disbale ksm (probably not important / needed)
> systemctl disable ksm
> systemctl disable ksmtuned
>
> 3. create LVM cache
>
> #set some variables and create a raid1 array with the two SSDs
>
> VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 &&
> hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm
> --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none
> --raid-devices=2 ${ssddevice1} ${ssddevice2}
>
> # create PV and extend VG
>
> pvcreate ${ssdraiddevice} && vgextend ${VGBASE} ${ssdraiddevice}
>
> # create slow LV on HDDs (use max space left if you want)
>
> pvdisplay ${hddraiddevice}
> lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
>
> # create the meta and data: for testing purposes I keep about 20G
> of the SSD for a uncached lv. To rule out it is not the SSD.
>
> lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
>
> #The rest can be used as cachedata/metadata.
>
> pvdisplay ${ssdraiddevice}
> # about 1/1000 of the space you have left on the SSD for the meta
> (minimum of 4)
> lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}
> # the rest can be used as cachedata
> lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
>
> # convert/combine pools so cachedlv is actually cached
>
> lvconvert --type cache-pool --cachemode writeback --poolmetadata
> ${VGBASE}/cachemeta ${VGBASE}/cachedata
>
> lvconvert --type cache --cachepool ${VGBASE}/cachedata
> ${VGBASE}/cachedlv
>
>
> # my system now looks like (VG is called cl, default of installer)
> [root@localhost ~]# lvs -a
> LV VG Attr LSize Pool Origin
> [cachedata] cl Cwi---C--- 97.66g
> * [cachedata_cdata] cl Cwi-ao---- 97.66g **
> ** [cachedata_cmeta] cl ewi-ao---- 100.00m *
> * cachedlv cl Cwi-aoC--- 1.75t [cachedata]
> [cachedlv_corig] *
> [cachedlv_corig] cl owi-aoC--- 1.75t
> [lvol0_pmspare] cl ewi------- 100.00m
> root cl -wi-ao---- 46.56g
> swap cl -wi-ao---- 14.96g
> * testssd cl -wi-a----- 45.47g
>
> *[root@localhost ~]#lsblk*
> *
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sdd 8:48 0 163G 0 disk
> └─sdd1 8:49 0 163G 0 part
> └─md128 9:128 0 162.9G 0 raid1
> ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> ├─cl-testssd 253:2 0 45.5G 0 lvm
> └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
> └─cl-cachedlv 253:6 0 1.8T 0 lvm
> sdb 8:16 0 1.8T 0 disk
> ├─sdb2 8:18 0 1.8T 0 part
> │ └─md127 9:127 0 1.8T 0 raid1
> │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
> │ ├─cl-root 253:0 0 46.6G 0 lvm /
> │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> └─sdb1 8:17 0 954M 0 part
> └─md126 9:126 0 954M 0 raid1 /boot
> sdc 8:32 0 163G 0 disk
> └─sdc1 8:33 0 163G 0 part
> └─md128 9:128 0 162.9G 0 raid1
> ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> ├─cl-testssd 253:2 0 45.5G 0 lvm
> └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm
> └─cl-cachedlv 253:6 0 1.8T 0 lvm
> sda 8:0 0 1.8T 0 disk
> ├─sda2 8:2 0 1.8T 0 part
> │ └─md127 9:127 0 1.8T 0 raid1
> │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP]
> │ ├─cl-root 253:0 0 46.6G 0 lvm /
> │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm
> │ └─cl-cachedlv 253:6 0 1.8T 0 lvm
> └─sda1 8:1 0 954M 0 part
> └─md126 9:126 0 954M 0 raid1 /boot
>
> # now create vm
> wget
>
http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
>
http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
> -P /home/
> DISK=/dev/mapper/XXXX-cachedlv
>
> # watch out, my netsetup uses a custom bridge/network in the
> following command. Please replace with what you normally use.
> virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7
> --disk path=${DISK},cache=none,bus=virtio --network
> bridge=pubbr,model=virtio --cdrom
> /home/CentOS-6.9-x86_64-minimal.iso --graphics
> vnc,port=5998,listen=0.0.0.0 --cpu host
>
> # now connect with client PC to qemu
> virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system --name
> CentOS1
>
> And install everything on the single vda disc with LVM (i use
> defaults in anaconda, but remove the large /home to prevent SSD
> beeing over used).
>
> After install and reboot log in to VM and
>
> yum install epel-release -y && yum install screen fio htop -y
>
> and then run disk test:
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test *--filename=test* --bs=4k --iodepth=64 --size=4G
> --readwrite=randrw --rwmixread=75
>
> then *keep repeating *but *change the filename* attribute so it
> does not use the same blocks over and over again.
>
> In the beginning the performance is great!! Wow, very impressive
> 150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss).
> But after a few (usually about 4 or 5) runs (always changing the
> filename, but not overfilling the FS, it drops to about 10 MBs/sec.
>
> normal/in the beginning
>
> read : io=3073.2MB, bw=183085KB/s, *iops=45771* , runt= 17188msec
> write: io=1022.1MB, bw=60940KB/s, *iops=15235* , runt= 17188msec
>
> but then
>
> read : io=3073.2MB, bw=183085KB/s, *iops=**2904* , runt= 17188msec
> write: io=1022.1MB, bw=60940KB/s, *iops=1751* , runt= 17188msec
>
> or even worse up to the point that it is actually the HDD that is
> written to (about 500 iops).
>
> P.S. when a test is/was slow, that means it is on the HDDs. So
> even when fixing the problem (sometimes just by waiting), that
> specific file will keep being slow when redoing the test till its
> promoted to the lvm cache (takes a lot of reads I think). And once
> on the SSD it sometimes keeps beeing fast, although a new testfile
> will be slow. So I really recommend changing the testfile all the
> time when trying to see if a change in speed has occurred.
>
> --
> Met vriendelijke groet,
>
> Richard Landsman
>
http://rimote.nl
>
> T: +31 (0)50 - 763 04 07
> (ma-vr 9:00 tot 18:00)
>
> 24/7 bij storingen:
> +31 (0)6 - 4388 7949
> @RimoteSaS (Twitter Serviceberichten/security updates)
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt@centos.org
mailto:CentOS-virt@centos.org
>
https://lists.centos.org/mailman/listinfo/centos-virt
>
https://lists.centos.org/mailman/listinfo/centos-virt
>
>
>
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D
>
> Red Hat EMEA
https://www.redhat.com/
>
>
https://red.ht/sig
> TRIED. TESTED. TRUSTED.
https://redhat.com/trusted
>
>
>
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt@centos.org
>
https://lists.centos.org/mailman/listinfo/centos-virt