<div dir="ltr">Adding Paolo and Miroslav.</div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Apr 8, 2017 at 4:49 PM, Richard Landsman - Rimote <span dir="ltr"><<a href="mailto:richard@rimote.nl" target="_blank">richard@rimote.nl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    <p>Hello,</p>

    <p>I would really appreciate some help/guidance with this problem.

      First of all sorry for the long message. I would file a bug, but

      do not know if it is my fault, dm-cache, qemu or (probably) a

      combination of both. And i can imagine some of you have this setup

      up and running without problems (or maybe you think it works, just

      like i did, but it does not):</p>

    <p>PROBLEM<br>

      LVM cache writeback stops working as expected after a while with a

      qemu-kvm VM. A 100% working setup would be the holy grail in my

      opinion... and the performance of KVM/qemu is great i must say in

      the beginning.<br>

    </p>

    <p>DESCRIPTION</p>

    <p>When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and

      create a cached LV out of them, the VM performs initially great

      (at least 40.000 IOPS on 4k rand read/write)! But then after a

      while (and a lot of random IO, ca 10 - 20 G) it effectively turns

      in to a writethrough cache although there's much space left on the

      cachedlv.</p>

    <p><br>

    </p>

    <p>When  working as expected on KVM host all writes go to SSDs</p>

    <p>iostat -x -m 2<br>

      <br>

      Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s

      avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br>

      sda               0.00   324.50    0.00   22.00     0.00    14.94 

      1390.57     1.90   86.39    0.00   86.39   5.32  11.70<br>

      sdb               0.00   324.50    0.00   22.00     0.00    14.94 

      1390.57     2.03   92.45    0.00   92.45   5.48  12.05<br>

      sdc               0.00  3932.00    0.00 <b>2191.50</b>     0.00  

      <b>270.07</b>   252.39    37.83   17.55    0.00   17.55   0.36 <b>

        78.05</b><br>

      sdd               0.00  3932.00    0.00 <b>2197.50 </b>    0.00  

      <b>271.01 </b>  252.57    38.96   18.14    0.00   18.14   0.36  <b>78.95</b><br>

    </p>

    <p><br>

    </p>

    <p>When not working as expected on KVM host all writes go through

      the SSD on to the HDDs (effectively disabling writeback so it

      becomes a writethrough)<br>

    </p>

    <p>Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s

      avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br>

      sda               0.00     7.00  234.50  <b>173.50 </b>   

      0.92    <b> 1.95</b>    14.38    29.27   71.27  111.89   16.37  

      2.45 <b>100.00</b><br>

      sdb               0.00     3.50  212.00  <b>177.50 </b>   

      0.83    <b> 1.95</b>    14.60    35.58   91.24  143.00   29.42  

      2.57<b> 100.10</b><br>

      sdc               2.50     0.00  566.00  <b>199.00 </b>   

      2.69     0.78     9.28     0.08    0.11    0.13    0.04   0.10   <b>7.70</b><br>

      sdd               1.50     0.00   76.00  <b>199.00</b>    

      0.65     0.78    10.66     0.02    0.07    0.16    0.04   0.07   <b>1.85</b></p>

    <p><br>

    </p>

    <p>Stuff i've checked/tried:</p>

    <p>- The data in the cached LV has then not exceeded even half of

      the space, so this should not happen. It even happens when only

      20% of cachedata is used.<br>

      - It seems to be triggerd most of the time when %cpy/sync column

      of `lvs -a` is about 30%. But this is not always the case!<br>

      - changing the cachepolicy from cleaner to smq, wait (check flush

      ready with lvs -a) and then back to smq seems to help <i>sometimes</i>!

      But not always...<br>

    </p>

    <p>lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv</p>

    <p>lvs -a<br>

    </p>

    <p>lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv<br>

      <br>

      - <b>when mounting the LV inside the host this does not seem to

        happen!!</b> So it looks like a qemu-kvm / dm-cache combination

      issue. Only difference is that inside host i do mkfs in stead of

      LVM inside VM (so could be LVM inside VM on top of LVM on KVM host

      problem too? small chance probably because the first 10 - 20GB it

      works great!)<br>

    </p>

    <p>- tried disabling Selinux, upgrading to newest kernels (elrepo ml

      and lt), played around with dirty_cache thingeys like

      proc/sys/vm/dirty_writeback_<wbr>centisecs

      /proc/sys/vm/dirty_expire_<wbr>centisecs cat /proc/sys/vm/dirty_ratio ,

      and migration threashold of dmsetup, and other probably non

      important stuff like

      vm.dirty_bytes<br>

    </p>

    <p>- when in "slow state" the systems kworkers are exessively using

      IO (10 - 20 MB per kworker process). This seems to be the

      writeback process (CPY%Sync) because the cache wants to flush to

      HDD. But the strange thing is that after a good sync (0% left),

      the disk may become slow again after a few MBs of data. A reboot

      sometimes helps.</p>

    <p>- have tried iothreads, virtio-scsi, vcpu driver setting on

      virtio-scsi controller, cachesettings, disk shedulers etc. Nothing

      helped.</p>

    <p>- the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have

      AMD FX(tm)-8350, 16G RAM<br>

    </p>

    <p>It feels like the lvm cache has a threshold (about 20G of data

      that is dirty) and that is stops allowing the qemu-kvm process to

      use writeback caching (the root uses inside the host seems to not

      have this limitation). It starts flushing, but only to a certain

      point. After a few  MBs of data it is right back in the slow spot

      again. Only solution is waiting for a long time (independant of

      CPY%SYNC) or sometimes change cachepolicy and force flush. This

      prevents for me the production use of this system. But it's so

      promising, so I hope somebody can help.<br>

    </p>

    <p>desired state:  Doing the FIO test (described in section

      reproduce) repeatedly should keep being fast till cachedlv is more

      or less full. If resyncing back to disc causes this degradation,

      it should actually flush it fully within a reasonable time and

      give opportunity to write fast again up to a given threshold. It

      now seems like a one time use cache that only uses a fraction of

      the SSD and is useless/very unstable afterwards.</p>

    <p>REPRODUCE<br>

      1. Install newest CentOS 7 on software RAID 1 HDDs with LVM. Keep

      a lot of space for the LVM cache (no /home)! So make the VG as

      large as possible during anaconda partitioning. <br>

    </p>

    <p>2. once installed and booted in to the system, install qemu-kvm<br>

    </p>

    <p>yum install -y centos-release-qemu-ev<br>

      yum install -y qemu-kvm-ev libvirt bridge-utils net-tools<br>

      # disbale ksm (probably not important / needed)<br>

      systemctl disable ksm<br>

      systemctl disable ksmtuned<br>

    </p>

    <p>3. create LVM cache</p>

    <p>#set some variables and create a raid1 array with the two SSDs<br>

    </p>

    <p>VGBASE= && ssddevice1=/dev/sdX1 &&

      ssddevice2=/dev/sdX1 && hddraiddevice=/dev/mdXXX

      && ssdraiddevice=/dev/mdXXX && mdadm --create

      --verbose ${ssdraiddevice} --level=mirror --bitmap=none

      --raid-devices=2 ${ssddevice1} ${ssddevice2}</p>

    <p># create PV and extend VG<br>

    </p>

    <p> pvcreate ${ssdraiddevice} && vgextend ${VGBASE}

      ${ssdraiddevice}<br>

    </p>

    <p># create slow LV on HDDs (use max space left if you want)<br>

    </p>

    <p> pvdisplay ${hddraiddevice}<br>

       lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}</p>

    <p># create the meta and data: for testing purposes I keep about 20G

      of the SSD for a uncached lv. To rule out it is not the SSD.<br>

    </p>

    <p>lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}</p>

    <p>#The rest can be used as cachedata/metadata.<br>

    </p>

    <p> pvdisplay ${ssdraiddevice}<br>

      # about 1/1000 of the space you have left on the SSD for the meta

      (minimum of 4)<br>

       lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}<br>

      # the rest can be used as cachedata      <br>

       lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}</p>

    <p># convert/combine pools so cachedlv is actually cached<br>

    </p>

    <p> lvconvert --type cache-pool --cachemode writeback --poolmetadata

      ${VGBASE}/cachemeta ${VGBASE}/cachedata</p>

    <p> lvconvert --type cache --cachepool ${VGBASE}/cachedata

      ${VGBASE}/cachedlv</p>

    <p><br>

    </p>

    # my system now looks like (VG is called cl, default of installer)<br>

    <font size="-1">[root@localhost ~]# lvs -a<br>

        LV                VG Attr       LSize   Pool       

      Origin           <br>

        [cachedata]       cl Cwi---C--- 

      97.66g                        <wbr>              <br>

      <b>  [cachedata_cdata] cl Cwi-ao---- 

97.66g                        <wbr>                              <wbr>              

      </b><b><br>

      </b><b>  [cachedata_cmeta] cl ewi-ao---- 100.00m     </b>                  <wbr>                              <wbr>               

      <br>

      <b>  cachedlv          cl Cwi-aoC---   1.75t [cachedata]

        [cachedlv_corig]     </b><br>

        [cachedlv_corig]  cl owi-aoC---  

1.75t                         <wbr>                              <wbr>             

      <br>

        [lvol0_pmspare]   cl ewi-------

100.00m                       <wbr>                              <wbr>               

      <br>

        root              cl -wi-ao---- 

46.56g                        <wbr>                              <wbr>              

      <br>

        swap              cl -wi-ao---- 

14.96g                        <wbr>                              <wbr>              

      <br>

      <b>  testssd           cl -wi-a-----  45.47g<br>

        <br>

      </b>[root@localhost ~]#lsblk<b><br>

      </b><br>

      NAME                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT<br>

      sdd                        8:48   0   163G  0 disk  <br>

      └─sdd1                     8:49   0   163G  0 part  <br>

        └─md128                  9:128  0 162.9G  0 raid1 <br>

          ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   <br>

          │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

          ├─cl-testssd         253:2    0  45.5G  0 lvm   <br>

          └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   <br>

            └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

      sdb                        8:16   0   1.8T  0 disk  <br>

      ├─sdb2                     8:18   0   1.8T  0 part  <br>

      │ └─md127                  9:127  0   1.8T  0 raid1 <br>

      │   ├─cl-swap            253:1    0    15G  0 lvm   [SWAP]<br>

      │   ├─cl-root            253:0    0  46.6G  0 lvm   /<br>

      │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   <br>

      │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

      └─sdb1                     8:17   0   954M  0 part  <br>

        └─md126                  9:126  0   954M  0 raid1 /boot<br>

      sdc                        8:32   0   163G  0 disk  <br>

      └─sdc1                     8:33   0   163G  0 part  <br>

        └─md128                  9:128  0 162.9G  0 raid1 <br>

          ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   <br>

          │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

          ├─cl-testssd         253:2    0  45.5G  0 lvm   <br>

          └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   <br>

            └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

      sda                        8:0    0   1.8T  0 disk  <br>

      ├─sda2                     8:2    0   1.8T  0 part  <br>

      │ └─md127                  9:127  0   1.8T  0 raid1 <br>

      │   ├─cl-swap            253:1    0    15G  0 lvm   [SWAP]<br>

      │   ├─cl-root            253:0    0  46.6G  0 lvm   /<br>

      │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   <br>

      │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

      └─sda1                     8:1    0   954M  0 part  <br>

        └─md126                  9:126  0   954M  0 raid1 /boot</font><br>

    <br>

    # now create vm<br>

    wget

<a class="m_-2193999298270607371moz-txt-link-freetext" href="http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso" target="_blank">http://ftp.tudelft.nl/centos.<wbr>org/6/isos/x86_64/CentOS-6.9-<wbr>x86_64-minimal.iso</a>

    -P /home/<br>

    DISK=/dev/mapper/XXXX-cachedlv<br>

    <br>

    # watch out, my netsetup uses a custom bridge/network in the

    following command. Please replace with what you normally use.<br>

    virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7

    --disk path=${DISK},cache=none,bus=<wbr>virtio --network

    bridge=pubbr,model=virtio --cdrom

    /home/CentOS-6.9-x86_64-<wbr>minimal.iso --graphics

    vnc,port=5998,listen=0.0.0.0 --cpu host <br>

    <br>

    # now connect with client PC to qemu<br>

    virt-viewer --connect=qemu+ssh://root@192.<wbr>168.0.XXX/system --name

    CentOS1<br>

    <br>

    And install everything on the single vda disc with LVM (i use

    defaults in anaconda, but remove the large /home to prevent SSD

    beeing over used). <br>

    <br>

    After install and reboot log in to VM and<br>

    <br>

    yum install epel-release -y && yum install screen fio htop

    -y<br>

    <br>

    and then run disk test:<br>

    <br>

    fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1

    --name=test <b>--filename=test</b> --bs=4k --iodepth=64 --size=4G

    --readwrite=randrw --rwmixread=75<br>

    <br>

    then <b>keep repeating </b>but <b>change the filename</b>

    attribute so it does not use the same blocks over and over again. <br>

    <br>

    In the beginning the performance is great!! Wow, very impressive

    150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss).

    But after a few (usually about 4 or 5) runs (always changing the

    filename, but not overfilling the FS, it drops to about 10 MBs/sec.

    <br>

    <br>

    normal/in the beginning<br>

    <br>

     read : io=3073.2MB, bw=183085KB/s, <b>iops=45771</b> , runt=

    17188msec<br>

      write: io=1022.1MB, bw=60940KB/s, <b>iops=15235</b> , runt=

    17188msec<br>

    <br>

    but then<br>

    <br>

     read : io=3073.2MB, bw=183085KB/s, <b>iops=</b><b>2904</b> , runt=

    17188msec<br>

      write: io=1022.1MB, bw=60940KB/s, <b>iops=1751</b> , runt=

    17188msec<br>

    <br>

    or even worse up to the point that it is actually the HDD that is

    written to (about 500 iops).<br>

    <br>

    P.S. when a test is/was slow, that means it is on the HDDs. So even

    when fixing the problem (sometimes just by waiting), that specific

    file will keep being slow when redoing the test till its promoted to

    the lvm cache (takes a lot of reads I think). And once on the SSD it

    sometimes keeps beeing fast, although a new testfile will be slow.

    So I really recommend changing the testfile all the time when trying

    to see if a change in speed has occurred. <br><span class="HOEnZb"><font color="#888888">

    <br>

    <pre class="m_-2193999298270607371moz-signature" cols="72">-- 

Met vriendelijke groet,

Richard Landsman

<a class="m_-2193999298270607371moz-txt-link-freetext" href="http://rimote.nl" target="_blank">http://rimote.nl</a>

T: +31 (0)50 - 763 04 07

(ma-vr 9:00 tot 18:00)

24/7 bij storingen:

+31 (0)6 - 4388 7949

@RimoteSaS (Twitter Serviceberichten/security updates) </pre>

  </font></span></div>

<br>______________________________<wbr>_________________<br>

CentOS-virt mailing list<br>

<a href="mailto:CentOS-virt@centos.org">CentOS-virt@centos.org</a><br>

<a href="https://lists.centos.org/mailman/listinfo/centos-virt" rel="noreferrer" target="_blank">https://lists.centos.org/<wbr>mailman/listinfo/centos-virt</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><p style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-weight:bold;margin:0px;padding:0px;font-size:14px;text-transform:uppercase"><span>SANDRO</span> <span>BONAZZOLA</span></p><p style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-size:10px;margin:0px 0px 4px;text-transform:uppercase"><span>ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R&D</span></p><p style="font-family:overpass,sans-serif;margin:0px;font-size:10px;color:rgb(153,153,153)"><a href="https://www.redhat.com/" style="color:rgb(0,136,206);margin:0px" target="_blank">Red Hat <span>EMEA</span></a></p><table border="0" style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-size:medium"><tbody><tr><td width="100px"><a href="https://red.ht/sig" target="_blank"><img src="https://www.redhat.com/profiles/rh/themes/redhatdotcom/img/logo-red-hat-black.png" width="90" height="auto"></a></td><td style="font-size:10px"><div><a href="https://redhat.com/trusted" style="color:rgb(204,0,0);font-weight:bold" target="_blank">TRIED. TESTED. TRUSTED.</a></div></td></tr></tbody></table></div></div></div></div></div></div></div>

</div>