<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Hello everyone,</p>

    <p>Anybody had the chance to test out this setup and reproduce the

      problem? I assumed it would be something that's used often these

      days and a solution would benefit a lot of users. If can be of any

      assistance please contact me. <br>

    </p>

    <pre class="moz-signature" cols="72">-- 

Met vriendelijke groet,

Richard Landsman

<a class="moz-txt-link-freetext" href="http://rimote.nl">http://rimote.nl</a>

T: +31 (0)50 - 763 04 07

(ma-vr 9:00 tot 18:00)

24/7 bij storingen:

+31 (0)6 - 4388 7949

@RimoteSaS (Twitter Serviceberichten/security updates) </pre>

    <div class="moz-cite-prefix">On 04/10/2017 10:08 AM, Sandro

      Bonazzola wrote:<br>

    </div>

    <blockquote

cite="mid:CAPQRNTnM_ZGtWayLyg1DOzZNks3_bmXOjXUg=JREGWDhQ0Ha6g@mail.gmail.com"

      type="cite">

      <div dir="ltr">Adding Paolo and Miroslav.</div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Sat, Apr 8, 2017 at 4:49 PM, Richard

          Landsman - Rimote <span dir="ltr"><<a

              moz-do-not-send="true" href="mailto:richard@rimote.nl"

              target="_blank">richard@rimote.nl</a>></span> wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000">

              <p>Hello,</p>

              <p>I would really appreciate some help/guidance with this

                problem. First of all sorry for the long message. I

                would file a bug, but do not know if it is my fault,

                dm-cache, qemu or (probably) a combination of both. And

                i can imagine some of you have this setup up and running

                without problems (or maybe you think it works, just like

                i did, but it does not):</p>

              <p>PROBLEM<br>

                LVM cache writeback stops working as expected after a

                while with a qemu-kvm VM. A 100% working setup would be

                the holy grail in my opinion... and the performance of

                KVM/qemu is great i must say in the beginning.<br>

              </p>

              <p>DESCRIPTION</p>

              <p>When using software RAID 1 (2x HDD) + software RAID 1

                (2xSSD) and create a cached LV out of them, the VM

                performs initially great (at least 40.000 IOPS on 4k

                rand read/write)! But then after a while (and a lot of

                random IO, ca 10 - 20 G) it effectively turns in to a

                writethrough cache although there's much space left on

                the cachedlv.</p>

              <p><br>

              </p>

              <p>When  working as expected on KVM host all writes go to

                SSDs</p>

              <p>iostat -x -m 2<br>

                <br>

                Device:         rrqm/s   wrqm/s     r/s     w/s   

                rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await

                w_await  svctm  %util<br>

                sda               0.00   324.50    0.00   22.00    

                0.00    14.94  1390.57     1.90   86.39    0.00  

                86.39   5.32  11.70<br>

                sdb               0.00   324.50    0.00   22.00    

                0.00    14.94  1390.57     2.03   92.45    0.00  

                92.45   5.48  12.05<br>

                sdc               0.00  3932.00    0.00 <b>2191.50</b>    

                0.00   <b>270.07</b>   252.39    37.83   17.55   

                0.00   17.55   0.36 <b> 78.05</b><br>

                sdd               0.00  3932.00    0.00 <b>2197.50 </b>   

                0.00   <b>271.01 </b>  252.57    38.96   18.14   

                0.00   18.14   0.36  <b>78.95</b><br>

              </p>

              <p><br>

              </p>

              <p>When not working as expected on KVM host all writes go

                through the SSD on to the HDDs (effectively disabling

                writeback so it becomes a writethrough)<br>

              </p>

              <p>Device:         rrqm/s   wrqm/s     r/s     w/s   

                rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await

                w_await  svctm  %util<br>

                sda               0.00     7.00  234.50  <b>173.50 </b>   

                0.92    <b> 1.95</b>    14.38    29.27   71.27  111.89  

                16.37   2.45 <b>100.00</b><br>

                sdb               0.00     3.50  212.00  <b>177.50 </b>   

                0.83    <b> 1.95</b>    14.60    35.58   91.24  143.00  

                29.42   2.57<b> 100.10</b><br>

                sdc               2.50     0.00  566.00  <b>199.00 </b>   

                2.69     0.78     9.28     0.08    0.11    0.13   

                0.04   0.10   <b>7.70</b><br>

                sdd               1.50     0.00   76.00  <b>199.00</b>    

                0.65     0.78    10.66     0.02    0.07    0.16   

                0.04   0.07   <b>1.85</b></p>

              <p><br>

              </p>

              <p>Stuff i've checked/tried:</p>

              <p>- The data in the cached LV has then not exceeded even

                half of the space, so this should not happen. It even

                happens when only 20% of cachedata is used.<br>

                - It seems to be triggerd most of the time when

                %cpy/sync column of `lvs -a` is about 30%. But this is

                not always the case!<br>

                - changing the cachepolicy from cleaner to smq, wait

                (check flush ready with lvs -a) and then back to smq

                seems to help <i>sometimes</i>! But not always...<br>

              </p>

              <p>lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv</p>

              <p>lvs -a<br>

              </p>

              <p>lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv<br>

                <br>

                - <b>when mounting the LV inside the host this does not

                  seem to happen!!</b> So it looks like a qemu-kvm /

                dm-cache combination issue. Only difference is that

                inside host i do mkfs in stead of LVM inside VM (so

                could be LVM inside VM on top of LVM on KVM host problem

                too? small chance probably because the first 10 - 20GB

                it works great!)<br>

              </p>

              <p>- tried disabling Selinux, upgrading to newest kernels

                (elrepo ml and lt), played around with dirty_cache

                thingeys like proc/sys/vm/dirty_writeback_<wbr>centisecs

                /proc/sys/vm/dirty_expire_<wbr>centisecs cat

                /proc/sys/vm/dirty_ratio , and migration threashold of

                dmsetup, and other probably non important stuff like

                vm.dirty_bytes<br>

              </p>

              <p>- when in "slow state" the systems kworkers are

                exessively using IO (10 - 20 MB per kworker process).

                This seems to be the writeback process (CPY%Sync)

                because the cache wants to flush to HDD. But the strange

                thing is that after a good sync (0% left), the disk may

                become slow again after a few MBs of data. A reboot

                sometimes helps.</p>

              <p>- have tried iothreads, virtio-scsi, vcpu driver

                setting on virtio-scsi controller, cachesettings, disk

                shedulers etc. Nothing helped.</p>

              <p>- the new samsung 950 PRO SSDs have HPA enabled

                (30%!!), i have AMD FX(tm)-8350, 16G RAM<br>

              </p>

              <p>It feels like the lvm cache has a threshold (about 20G

                of data that is dirty) and that is stops allowing the

                qemu-kvm process to use writeback caching (the root uses

                inside the host seems to not have this limitation). It

                starts flushing, but only to a certain point. After a

                few  MBs of data it is right back in the slow spot

                again. Only solution is waiting for a long time

                (independant of CPY%SYNC) or sometimes change

                cachepolicy and force flush. This prevents for me the

                production use of this system. But it's so promising, so

                I hope somebody can help.<br>

              </p>

              <p>desired state:  Doing the FIO test (described in

                section reproduce) repeatedly should keep being fast

                till cachedlv is more or less full. If resyncing back to

                disc causes this degradation, it should actually flush

                it fully within a reasonable time and give opportunity

                to write fast again up to a given threshold. It now

                seems like a one time use cache that only uses a

                fraction of the SSD and is useless/very unstable

                afterwards.</p>

              <p>REPRODUCE<br>

                1. Install newest CentOS 7 on software RAID 1 HDDs with

                LVM. Keep a lot of space for the LVM cache (no /home)!

                So make the VG as large as possible during anaconda

                partitioning. <br>

              </p>

              <p>2. once installed and booted in to the system, install

                qemu-kvm<br>

              </p>

              <p>yum install -y centos-release-qemu-ev<br>

                yum install -y qemu-kvm-ev libvirt bridge-utils

                net-tools<br>

                # disbale ksm (probably not important / needed)<br>

                systemctl disable ksm<br>

                systemctl disable ksmtuned<br>

              </p>

              <p>3. create LVM cache</p>

              <p>#set some variables and create a raid1 array with the

                two SSDs<br>

              </p>

              <p>VGBASE= && ssddevice1=/dev/sdX1 &&

                ssddevice2=/dev/sdX1 && hddraiddevice=/dev/mdXXX

                && ssdraiddevice=/dev/mdXXX && mdadm

                --create --verbose ${ssdraiddevice} --level=mirror

                --bitmap=none --raid-devices=2 ${ssddevice1}

                ${ssddevice2}</p>

              <p># create PV and extend VG<br>

              </p>

              <p> pvcreate ${ssdraiddevice} && vgextend

                ${VGBASE} ${ssdraiddevice}<br>

              </p>

              <p># create slow LV on HDDs (use max space left if you

                want)<br>

              </p>

              <p> pvdisplay ${hddraiddevice}<br>

                 lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}</p>

              <p># create the meta and data: for testing purposes I keep

                about 20G of the SSD for a uncached lv. To rule out it

                is not the SSD.<br>

              </p>

              <p>lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}</p>

              <p>#The rest can be used as cachedata/metadata.<br>

              </p>

              <p> pvdisplay ${ssdraiddevice}<br>

                # about 1/1000 of the space you have left on the SSD for

                the meta (minimum of 4)<br>

                 lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}<br>

                # the rest can be used as cachedata      <br>

                 lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}</p>

              <p># convert/combine pools so cachedlv is actually cached<br>

              </p>

              <p> lvconvert --type cache-pool --cachemode writeback

                --poolmetadata ${VGBASE}/cachemeta ${VGBASE}/cachedata</p>

              <p> lvconvert --type cache --cachepool ${VGBASE}/cachedata

                ${VGBASE}/cachedlv</p>

              <p><br>

              </p>

              # my system now looks like (VG is called cl, default of

              installer)<br>

              <font size="-1">[root@localhost ~]# lvs -a<br>

                  LV                VG Attr       LSize   Pool       

                Origin           <br>

                  [cachedata]       cl Cwi---C--- 

                97.66g                        <wbr>              <br>

                <b>  [cachedata_cdata] cl Cwi-ao---- 

                  97.66g                        <wbr>                              <wbr>              

                </b><b><br>

                </b><b>  [cachedata_cmeta] cl ewi-ao---- 100.00m     </b>                  <wbr>                              <wbr>               

                <br>

                <b>  cachedlv          cl Cwi-aoC---   1.75t [cachedata]

                  [cachedlv_corig]     </b><br>

                  [cachedlv_corig]  cl owi-aoC---  

                1.75t                         <wbr>                              <wbr>             

                <br>

                  [lvol0_pmspare]   cl ewi-------

                100.00m                       <wbr>                              <wbr>               

                <br>

                  root              cl -wi-ao---- 

                46.56g                        <wbr>                              <wbr>              

                <br>

                  swap              cl -wi-ao---- 

                14.96g                        <wbr>                              <wbr>              

                <br>

                <b>  testssd           cl -wi-a-----  45.47g<br>

                  <br>

                </b>[root@localhost ~]#lsblk<b><br>

                </b><br>

                NAME                     MAJ:MIN RM   SIZE RO TYPE 

                MOUNTPOINT<br>

                sdd                        8:48   0   163G  0 disk  <br>

                └─sdd1                     8:49   0   163G  0 part  <br>

                  └─md128                  9:128  0 162.9G  0 raid1 <br>

                    ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   <br>

                    │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                    ├─cl-testssd         253:2    0  45.5G  0 lvm   <br>

                    └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   <br>

                      └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                sdb                        8:16   0   1.8T  0 disk  <br>

                ├─sdb2                     8:18   0   1.8T  0 part  <br>

                │ └─md127                  9:127  0   1.8T  0 raid1 <br>

                │   ├─cl-swap            253:1    0    15G  0 lvm  

                [SWAP]<br>

                │   ├─cl-root            253:0    0  46.6G  0 lvm   /<br>

                │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   <br>

                │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                └─sdb1                     8:17   0   954M  0 part  <br>

                  └─md126                  9:126  0   954M  0 raid1

                /boot<br>

                sdc                        8:32   0   163G  0 disk  <br>

                └─sdc1                     8:33   0   163G  0 part  <br>

                  └─md128                  9:128  0 162.9G  0 raid1 <br>

                    ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   <br>

                    │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                    ├─cl-testssd         253:2    0  45.5G  0 lvm   <br>

                    └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   <br>

                      └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                sda                        8:0    0   1.8T  0 disk  <br>

                ├─sda2                     8:2    0   1.8T  0 part  <br>

                │ └─md127                  9:127  0   1.8T  0 raid1 <br>

                │   ├─cl-swap            253:1    0    15G  0 lvm  

                [SWAP]<br>

                │   ├─cl-root            253:0    0  46.6G  0 lvm   /<br>

                │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   <br>

                │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   <br>

                └─sda1                     8:1    0   954M  0 part  <br>

                  └─md126                  9:126  0   954M  0 raid1

                /boot</font><br>

              <br>

              # now create vm<br>

              wget

              <a moz-do-not-send="true"

                class="m_-2193999298270607371moz-txt-link-freetext"

href="http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso"

                target="_blank">http://ftp.tudelft.nl/centos.<wbr>org/6/isos/x86_64/CentOS-6.9-<wbr>x86_64-minimal.iso</a>

              -P /home/<br>

              DISK=/dev/mapper/XXXX-cachedlv<br>

              <br>

              # watch out, my netsetup uses a custom bridge/network in

              the following command. Please replace with what you

              normally use.<br>

              virt-install -n CentOS1 -r 12000 --os-variant=centos6.7

              --vcpus 7 --disk path=${DISK},cache=none,bus=<wbr>virtio

              --network bridge=pubbr,model=virtio --cdrom

              /home/CentOS-6.9-x86_64-<wbr>minimal.iso --graphics

              vnc,port=5998,listen=0.0.0.0 --cpu host <br>

              <br>

              # now connect with client PC to qemu<br>

              virt-viewer --connect=qemu+ssh://root@192.<wbr>168.0.XXX/system

              --name CentOS1<br>

              <br>

              And install everything on the single vda disc with LVM (i

              use defaults in anaconda, but remove the large /home to

              prevent SSD beeing over used). <br>

              <br>

              After install and reboot log in to VM and<br>

              <br>

              yum install epel-release -y && yum install screen

              fio htop -y<br>

              <br>

              and then run disk test:<br>

              <br>

              fio --randrepeat=1 --ioengine=libaio --direct=1

              --gtod_reduce=1 --name=test <b>--filename=test</b>

              --bs=4k --iodepth=64 --size=4G --readwrite=randrw

              --rwmixread=75<br>

              <br>

              then <b>keep repeating </b>but <b>change the filename</b>

              attribute so it does not use the same blocks over and over

              again. <br>

              <br>

              In the beginning the performance is great!! Wow, very

              impressive 150MB/s 4k random r/w (close to bare metal,

              about 20% - 30% loss). But after a few (usually about 4 or

              5) runs (always changing the filename, but not overfilling

              the FS, it drops to about 10 MBs/sec. <br>

              <br>

              normal/in the beginning<br>

              <br>

               read : io=3073.2MB, bw=183085KB/s, <b>iops=45771</b> ,

              runt= 17188msec<br>

                write: io=1022.1MB, bw=60940KB/s, <b>iops=15235</b> ,

              runt= 17188msec<br>

              <br>

              but then<br>

              <br>

               read : io=3073.2MB, bw=183085KB/s, <b>iops=</b><b>2904</b>

              , runt= 17188msec<br>

                write: io=1022.1MB, bw=60940KB/s, <b>iops=1751</b> ,

              runt= 17188msec<br>

              <br>

              or even worse up to the point that it is actually the HDD

              that is written to (about 500 iops).<br>

              <br>

              P.S. when a test is/was slow, that means it is on the

              HDDs. So even when fixing the problem (sometimes just by

              waiting), that specific file will keep being slow when

              redoing the test till its promoted to the lvm cache (takes

              a lot of reads I think). And once on the SSD it sometimes

              keeps beeing fast, although a new testfile will be slow.

              So I really recommend changing the testfile all the time

              when trying to see if a change in speed has occurred. <br>

              <span class="HOEnZb"><font color="#888888"> <br>

                  <pre class="m_-2193999298270607371moz-signature" cols="72">-- 

Met vriendelijke groet,

Richard Landsman

<a moz-do-not-send="true" class="m_-2193999298270607371moz-txt-link-freetext" href="http://rimote.nl" target="_blank">http://rimote.nl</a>

T: +31 (0)50 - 763 04 07

(ma-vr 9:00 tot 18:00)

24/7 bij storingen:

+31 (0)6 - 4388 7949

@RimoteSaS (Twitter Serviceberichten/security updates) </pre>

                </font></span></div>

            <br>

            ______________________________<wbr>_________________<br>

            CentOS-virt mailing list<br>

            <a moz-do-not-send="true"

              href="mailto:CentOS-virt@centos.org">CentOS-virt@centos.org</a><br>

            <a moz-do-not-send="true"

              href="https://lists.centos.org/mailman/listinfo/centos-virt"

              rel="noreferrer" target="_blank">https://lists.centos.org/<wbr>mailman/listinfo/centos-virt</a><br>

            <br>

          </blockquote>

        </div>

        <br>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div class="gmail_signature" data-smartmail="gmail_signature">

          <div dir="ltr">

            <div>

              <div dir="ltr">

                <div dir="ltr">

                  <div dir="ltr">

                    <div dir="ltr">

                      <p

style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-weight:bold;margin:0px;padding:0px;font-size:14px;text-transform:uppercase"><span>SANDRO</span> <span>BONAZZOLA</span></p>

                      <p

style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-size:10px;margin:0px

                        0px 4px;text-transform:uppercase"><span>ASSOCIATE

                          MANAGER, SOFTWARE ENGINEERING, EMEA ENG

                          VIRTUALIZATION R&D</span></p>

                      <p

style="font-family:overpass,sans-serif;margin:0px;font-size:10px;color:rgb(153,153,153)"><a

                          moz-do-not-send="true"

                          href="https://www.redhat.com/"

                          style="color:rgb(0,136,206);margin:0px"

                          target="_blank">Red Hat <span>EMEA</span></a></p>

                      <table

style="color:rgb(0,0,0);font-family:overpass,sans-serif;font-size:medium"

                        border="0">

                        <tbody>

                          <tr>

                            <td width="100px"><a moz-do-not-send="true"

                                href="https://red.ht/sig"

                                target="_blank"><img

                                  moz-do-not-send="true"

src="https://www.redhat.com/profiles/rh/themes/redhatdotcom/img/logo-red-hat-black.png"

                                  width="90" height="auto"></a></td>

                            <td style="font-size:10px">

                              <div><a moz-do-not-send="true"

                                  href="https://redhat.com/trusted"

                                  style="color:rgb(204,0,0);font-weight:bold"

                                  target="_blank">TRIED. TESTED.

                                  TRUSTED.</a></div>

                            </td>

                          </tr>

                        </tbody>

                      </table>

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

CentOS-virt mailing list

<a class="moz-txt-link-abbreviated" href="mailto:CentOS-virt@centos.org">CentOS-virt@centos.org</a>

<a class="moz-txt-link-freetext" href="https://lists.centos.org/mailman/listinfo/centos-virt">https://lists.centos.org/mailman/listinfo/centos-virt</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>