[prev in list] [next in list] [prev in thread] [next in thread]
List: debian-user
Subject: Re: LVM write performance
From: Dion Kant <msn () concero ! nl>
Date: 2011-08-30 20:17:26
Message-ID: 4E5D4556.1040506 () concero ! nl
[Download RAW message or body]
On 08/20/2011 12:53 AM, Stan Hoeppner wrote:
> On 8/19/2011 4:38 PM, Dion Kant wrote:
>
>> I now think I understand the "strange" behaviour for block sizes not an
>> integral multiple of 4096 bytes. (Of course you guys already knew the
>> answer but just didn't want to make it easy for me to find the answer.)
>>
>> The newer disks today have a sector size of 4096 bytes. They may still
>> be reporting 512 bytes, but this is to keep some ancient OS-es working.
>>
>> When a block write is not an integral of 4096 bytes, for example 512
>> bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
>> it and finally write it back to the disk. This explains the bi and the
>> increased number of interrupts.
>>
>> I did some Google searches but did not find much. Can someone confirm
>> this hypothesis?
>
> The read-modify-write performance penalty of unaligned partitions on the
> "Advanced Format" drives (4KB native sectors) is a separate unrelated issue.
>
> As I demonstrated earlier in this thread, the performance drop seen when
> using dd with block sizes less than 4KB affects traditional 512B/sector
> drives as well. If one has a misaligned partition on an Advanced Format
> drive, one takes a double performance hit when dd bs is less than 4KB.
>
> Again, everything in (x86) Linux is optimized around the 'magic' 4KB
> size, including page size, filesystem block size, and LVM block size.
Ok, I have done some browsing through the kernel sources. I understand
the VFS a bit better now. When a read/write is issued on a block device
file, the block size is 4096 bytes, i.e. reads/writes to the disk are
done using blocks equal to the page cache size, i.e. the magic 4KB.
Submitting a request with a block size which is not an integral multiple
of 4096 bytes results in a call to ll_rw_block(READ, 1, &bh), which
reads in 4096 blocks, one by one into the page cache. This must be done
before the user data can be used to partially update the concerning
buffer page in the cache. After being updated, the buffer is flagged
dirty and finally written to disk (8 sectors of 512 bytes).
I found a nice debugging switch which helps monitoring the process.
echo 1 > /proc/sys/vm/block_dump
makes all bio requests being logged as kernel output.
Example:
dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync
[ 239.977384] dd(6110): READ block 0 on dm-3
[ 240.026952] dd(6110): READ block 8 on dm-3
[ 240.027735] dd(6110): WRITE block 0 on dm-3
[ 240.027754] dd(6110): WRITE block 8 on dm-3
The ll_rw_block(READ, 1, &bh) is causing the reads which can be seen
when monitoring with vmstat. The tests given below (as you requested)
were carried out before I gained a better understanding of the VFS.
However, remaining questions I still have are:
1. Why are the partial block updates (through ll_rw_block(READ, 1, &bh))
so dramatic slow as compared to other reads from the disk?
2. Furthermore remember the much better performance I reported when
mounting a file system on the block device first, before accessing the
disk through the block device file. If I find some more spare time I do
some more digging in the kernel. Maybe I find out that then a different
set of f_ops are used for accessing the raw block device by the Virtual
Filesystem Switch.
>
> BTW, did you run your test with each of the elevators, as I recommended?
> Do the following, testing dd after each change.
>
$ echo 128 > /sys/block/sdc/queue/read_ahead_kb
dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 54.0373 19.8704 1024
1024 54.2937 19.7765 1024
2048 52.1781 20.5784 1024
4096 13.751 78.0846 1024
8192 13.8519 77.5159 1024
dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 53.9634 19.8976 1024
1024 52.0421 20.6322 1024
2048 54.0437 19.868 1024
4096 13.9612 76.9088 1024
8192 13.8183 77.7043 1024
dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq]
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 56.0087 19.171 1024
1024 56.345 19.0565 1024
2048 56.0436 19.159 1024
4096 15.1232 70.9999 1024
8192 15.4236 69.6168 1024
>
> Also, just for fun, and interesting results, increase your read_ahead_kb
> from the default 128 to 512.
>
> $ echo 512 > /sys/block/sdX/queue/read_ahead_kb
$ echo deadline > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 54.1023 19.8465 1024
1024 52.1824 20.5767 1024
2048 54.3797 19.7453 1024
4096 13.7252 78.2315 1024
8192 13.727 78.2211 1024
> $ echo noop > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 54.0853 19.8527 1024
1024 54.525 19.6927 1024
2048 50.6829 21.1855 1024
4096 14.1272 76.0051 1024
8192 13.914 77.1701 1024
> $ echo cfq > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 56.0274 19.1646 1024
1024 55.7614 19.256 1024
2048 56.5394 18.991 1024
4096 16.0562 66.8739 1024
8192 17.3842 61.7654 1024
Differences between deadline and noop are in the order of 2 to 3 % in
favour of deadline. Remarkable is the run with the cfq elevator. It
clearly has less performance, about 20% less (compared to the highest
result) for the 512 read_ahead_kb case. Another try with the same settings:
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 56.8122 18.8999 1024
1024 56.5486 18.9879 1024
2048 56.2555 19.0869 1024
4096 14.886 72.1311 1024
8192 15.461 69.4486 1024
so it looks like the previous result was at the low end of the
statistical variation.
>
> These changes are volatile so a reboot clears them in the event you're
> unable to change them back to the defaults for any reason. This is
> easily avoidable if you simply cat the files and write down the values
> before changing them. After testing, echo the default values back in.
>
I did some testing on a newer system with an *AOC-USAS-S4i Adaptec
AACRAID* Controller on a Supermicro. It uses the aacraid driver. This
controller supports RAID0,1,10 but with configuring the controller in a
way that it published the disks as 4 single disk RAID0 to Linux (the
controller cannot do JBOD), we obtained much better performance with
Linux software RAID0, or striping with LVM or LVM on top of RAID0 as
compared to RAID0 being managed by the controller. Now we obtain 300 to
350 MByte/s sustained write performance as about 150 MB/s when using the
controller.
*We use 4 ST32000644NS drives.
Repeating the tests on this system gives similar results, let alone that
the 2 TB drives have a better write performance of about 50%.
capture4:~ # cat /sys/block/sdc/queue/read_ahead_kb
128
capture4:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
capture4:~ # ./bw /dev/sdc1
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
8192 8.5879 125.03 1024
4096 8.54407 125.671 1024
2048 65.0727 16.5007 1024
Note the performance drop by a factor 1/8 halving the bs from 4096 to 2048.
Reading a drive is 8.8% faster and works for all block sizes:
capture4:~ # ./br /dev/sdc1
Reading 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 7.86782 136.473 1024
1024 7.85202 136.747 1024
2048 7.85979 136.612 1024
4096 7.86932 136.447 1024
8192 7.8509 136.767 1024
dd gives similar results:
capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s
*Dion
[Attachment #3 (text/html)]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
<tt>On 08/20/2011 12:53 AM, Stan Hoeppner wrote:</tt>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>On 8/19/2011 4:38 PM, Dion Kant wrote:
</tt></pre>
<blockquote type="cite">
<pre wrap=""><tt>I now think I understand the "strange" behaviour for block \
sizes not an integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)
The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es working.
When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.
I did some Google searches but did not find much. Can someone confirm
this hypothesis?
</tt></pre>
</blockquote>
<pre wrap=""><tt>
The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.
As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well. If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.
Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.
</tt></pre>
</blockquote>
<tt>Ok, I have done some browsing through the kernel sources. I
understand the VFS a bit better now. When a read/write is issued
on a block device file, the block size is 4096 bytes, i.e.
reads/writes to the disk are done using blocks equal to the page
cache size, i.e. the magic 4KB.<br>
<br>
Submitting a request with a block size which is not an integral
multiple of 4096 bytes results in a call to ll_rw_block(READ, 1,
&bh), which reads in 4096 blocks, one by one into the page
cache. This must be done before the user data can be used to
partially update the concerning buffer page in the cache. After
being updated, the buffer is flagged dirty and finally written to
disk (8 sectors of 512 bytes).<br>
<br>
I found a nice debugging switch which helps monitoring the
process.<br>
<br>
echo 1 > /proc/sys/vm/block_dump<br>
<br>
makes all bio requests being logged as kernel output. <br>
<br>
</tt><tt>Example:<br>
<br>
dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync<br>
<br>
[ 239.977384] dd(6110): READ block 0 on dm-3<br>
[ 240.026952] dd(6110): READ block 8 on dm-3<br>
[ 240.027735] dd(6110): WRITE block 0 on dm-3<br>
[ 240.027754] dd(6110): WRITE block 8 on dm-3<br>
</tt><br>
<tt>The </tt><tt>ll_rw_block(READ, 1, &bh) </tt><tt>is
causing the reads which can be seen when monitoring with vmstat.
The tests given below (as you requested) were carried out before I
gained a better understanding of the VFS. However, remaining
questions I still have are:<br>
<br>
1. Why are the partial block updates (through </tt><tt>ll_rw_block(READ,
1, &bh)) so dramatic slow as compared to other reads from the
disk?<br>
<br>
2. Furthermore remember the much better performance I reported
when mounting a file system on the block device first, before
accessing the disk through the block device file. If I find some
more spare time I do some more digging in the kernel. Maybe I find
out that then a different set of f_ops are used for accessing the
raw block device by the Virtual Filesystem Switch.<br>
</tt><tt><br>
</tt>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>
BTW, did you run your test with each of the elevators, as I recommended?
Do the following, testing dd after each change.
</tt></pre>
</blockquote>
<tt>$ echo 128 > /sys/block/sdc/queue/read_ahead_kb<br>
</tt>
<pre wrap=""><tt>
dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 54.0373 19.8704 1024
1024 54.2937 19.7765 1024
2048 52.1781 20.5784 1024
4096 13.751 78.0846 1024
8192 13.8519 77.5159 1024
dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 53.9634 19.8976 1024
1024 52.0421 20.6322 1024
2048 54.0437 19.868 1024
4096 13.9612 76.9088 1024
8192 13.8183 77.7043 1024
dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq]
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 56.0087 19.171 1024
1024 56.345 19.0565 1024
2048 56.0436 19.159 1024
4096 15.1232 70.9999 1024
8192 15.4236 69.6168 1024
</tt></pre>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>
Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.
$ echo 512 > /sys/block/sdX/queue/read_ahead_kb
</tt></pre>
</blockquote>
<pre wrap=""><tt>$ echo deadline > /sys/block/sdX/queue/scheduler
</tt></pre>
<tt> dom0-2:~ # ./bw <br>
Writing 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br>
512 54.1023 \
19.8465 \
1024<br>
1024 52.1824 \
20.5767 \
1024<br>
2048 54.3797 \
19.7453 \
1024<br>
4096 13.7252 \
78.2315 \
1024<br>
8192 13.727 \
78.2211 1024<br> \
<br> </tt>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>$ echo noop > /sys/block/sdX/queue/scheduler
</tt></pre>
</blockquote>
<tt> dom0-2:~ # ./bw <br>
Writing 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br>
512 54.0853 \
19.8527 \
1024<br>
1024 54.525 \
19.6927 \
1024<br>
2048 50.6829 \
21.1855 \
1024<br>
4096 14.1272 \
76.0051 \
1024<br>
8192 13.914 \
77.1701 1024<br> \
<br> </tt>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>$ echo cfq > /sys/block/sdX/queue/scheduler
</tt></pre>
</blockquote>
<tt> dom0-2:~ # ./bw <br>
Writing 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br>
512 56.0274 \
19.1646 \
1024<br>
1024 55.7614 \
19.256 \
1024<br>
2048 56.5394 \
18.991 \
1024<br>
4096 16.0562 \
66.8739 \
1024<br>
8192 17.3842 \
61.7654 1024<br> \
<br> Differences between deadline and noop are in the order of 2 to 3 %
in favour of deadline. Remarkable is the run with the cfq
elevator. It clearly has less performance, about 20% less
(compared to the highest result) for the 512 read_ahead_kb case.
Another try with the same settings:<br>
<br>
dom0-2:~ # ./bw <br>
Writing 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br>
512 56.8122 \
18.8999 \
1024<br>
1024 56.5486 \
18.9879 \
1024<br>
2048 56.2555 \
19.0869 \
1024<br>
4096 14.886 \
72.1311 \
1024<br>
8192 15.461 \
69.4486 1024<br> \
<br> so it looks like the previous result was at the low end of the
statistical variation.<br>
<br>
<br>
</tt>
<blockquote cite="mid:4E4EE96B.4060604@hardwarefreak.com"
type="cite">
<pre wrap=""><tt>
These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason. This is
easily avoidable if you simply cat the files and write down the values
before changing them. After testing, echo the default values back in.
</tt></pre>
</blockquote>
<tt>I did some testing on a newer system with an <strong>AOC-USAS-S4i
Adaptec AACRAID</strong> Controller on a Supermicro. It uses the
aacraid driver. This controller supports RAID0,1,10 but with
configuring the controller in a way that it published the disks as
4 single disk RAID0 to Linux (the controller cannot do JBOD), we
obtained much better performance with Linux software RAID0, or
striping with LVM or LVM on top of RAID0 as compared to RAID0
being managed by the controller. Now we obtain 300 to 350 MByte/s
sustained write performance as about 150 MB/s when using the
controller.<br>
<br>
</tt><strong><tt> We use 4 ST32000644NS drives. <br>
<br>
Repeating the tests on this system gives similar results, let
alone that the 2 TB drives have a better write performance of
about 50%.<br>
<br>
<br>
capture4:~ # cat /sys/block/sdc/queue/read_ahead_kb<br>
128<br>
capture4:~ # cat /sys/block/sdc/queue/scheduler<br>
noop [deadline] cfq <br>
<br>
capture4:~ # ./bw /dev/sdc1<br>
Writing 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br> 8192 \
8.5879 \
125.03 \
1024<br>
4096 8.54407 \
125.671 \
1024<br>
2048 65.0727 \
16.5007 1024<br> \
<br> Note the performance drop by a factor 1/8 halving the bs from
4096 to 2048.<br>
<br>
Reading a drive is 8.8% faster and works for all block sizes:<br>
<br>
capture4:~ # ./br /dev/sdc1<br>
Reading 1 GB<br>
bs \
time rate<br>
(bytes) (s) \
(MiB/s) <br>
512 7.86782 \
136.473 \
1024<br>
1024 7.85202 \
136.747 \
1024<br>
2048 7.85979 \
136.612 \
1024<br>
4096 7.86932 \
136.447 \
1024<br>
8192 7.8509 \
136.767 1024<br> \
<br> dd gives similar results:<br>
capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152<br>
2097152+0 records in<br>
2097152+0 records out<br>
1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s<br>
<br>
</tt></strong>Dion<br>
</body>
</html>
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/4E5D4556.1040506@concero.nl
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic