[CentOS] Huge write amplification with thin provisioned logical volumes

Mon Dec 5 10:28:25 UTC 2016
Radu Radutiu <rradutiu at gmail.com>

Hi,

I've noticed huge write amplification problem with thinly provisioned
logical volumes and I wondered if anyone can explain why it happens and if
and how  can be fixed. The behavior is the same on Centos 6.8 and Centos
7.2.

I have a NVME card (Intel DC P3600 -2 TB) on which I create a thinly
provisioned logical volume:

  pvcreate /dev/nvme0n1
 vgcreate vgg /dev/nvme0n1
 lvcreate -l100%FREE -T vgg/thinpool
 lvcreate -V40000M -T vgg/thinpool -n brick1
 mkfs.xfs /dev/vgg/brick1

If I run a write test ( dd if=/dev/zero of=./zero.img bs=4k count=100000
oflag=dsync ) I see in iotop that the actual disk write is 30 times the
amount of data that I'm actually writing to disk).

 Total DISK READ: 0.00 B/s | Total DISK WRITE: 1001.23 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:53 34453 be/4 root        0.00 B/s   30.34 M/s  0.00 % 12.10 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
Total DISK READ: 0.00 B/s | Total DISK WRITE: 991.92 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:54 34453 be/4 root        0.00 B/s   30.05 M/s  0.00 % 12.63 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
Total DISK READ: 0.00 B/s | Total DISK WRITE: 1024.52 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:55 34453 be/4 root        0.00 B/s   31.05 M/s  0.00 % 12.49 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
10:59:55  1057 be/3 root        0.00 B/s   15.39 K/s  0.00 %  0.01 %
[jbd2/sda1-8]
Total DISK READ: 0.00 B/s | Total DISK WRITE: 967.60 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:56 34453 be/4 root        0.00 B/s   29.32 M/s  0.00 % 12.75 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
Total DISK READ: 0.00 B/s | Total DISK WRITE: 943.66 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:58 34453 be/4 root        0.00 B/s   28.60 M/s  0.00 % 11.79 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
10:59:58 34448 be/4 root        0.00 B/s    3.84 K/s  0.00 %  0.00 % python
/usr/sbin/iotop -o -b -t
Total DISK READ: 0.00 B/s | Total DISK WRITE: 959.40 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
10:59:59 34453 be/4 root        0.00 B/s   29.07 M/s  0.00 % 11.81 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
Total DISK READ: 0.00 B/s | Total DISK WRITE: 948.38 M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
11:00:00 34453 be/4 root        0.00 B/s   28.73 M/s  0.00 % 11.57 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync

For a 30MB/s write at the application level I get around 1000MB/s write at
the device level, i.e. a 33x amplification.

On Centos 6 if I try to align the data using the values from
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Setting%20Up%20Volumes/
I get only a 7x amplification.
On Cetos 7 I can see the same 7x amplification using the default lvcreate
options. This is the Centos 7 iotop output:

12:48:29 Total DISK READ :       0.00 B/s | Total DISK WRITE :      32.24
M/s
12:48:29 Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     226.63
M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
12:48:29 15234 be/3 root        0.00 B/s    3.80 K/s  0.00 % 35.20 %
[jbd2/dm-8-8]
12:48:29 15258 be/4 root        0.00 B/s   32.24 M/s  0.00 % 10.64 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
12:48:29 14870 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.05 %
[kworker/u80:1]
12:48:29 15240 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.03 %
[kworker/u80:2]
12:48:29 15255 be/4 root        0.00 B/s    3.80 K/s  0.00 %  0.00 % python
/usr/sbin/iotop -o -b -t
12:48:30 Total DISK READ :       0.00 B/s | Total DISK WRITE :      31.97
M/s
12:48:30 Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     224.85
M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
12:48:30 15234 be/3 root        0.00 B/s    0.00 B/s  0.00 % 35.14 %
[jbd2/dm-8-8]
12:48:30 15258 be/4 root        0.00 B/s   31.97 M/s  0.00 % 10.61 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync
12:48:30 14870 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.05 %
[kworker/u80:1]
12:48:30 15240 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.03 %
[kworker/u80:2]
12:48:31 Total DISK READ :       0.00 B/s | Total DISK WRITE :      32.50
M/s
12:48:31 Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     228.94
M/s
    TIME  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO
 COMMAND
12:48:31 15234 be/3 root        0.00 B/s    0.00 B/s  0.00 % 35.28 %
[jbd2/dm-8-8]
12:48:31 15258 be/4 root        0.00 B/s   32.48 M/s  0.00 % 10.72 % dd
if=/dev/zero of=./zero.img bs=4k count=100000 oflag=dsync

Still 7x write amplifications seems too much. Has anyone seen this or has
any explanation for it? I am rewriting the same file with dd multiple times
so the filesystem and thin lvm should use already provisioned space.

Best regards,
Radu