--eNMatiwYGLtwo1cJ
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

[Please Cc: the answers to me, I'm not subscribed.]

[As this concerns both filesystems and device-mapper, and I'm hesitant
to crosspost to kernel lists, I'm posting to lkml. Feel free to continue
the discussion on more appropriate lists.]

Hello lkml,

While moving my backups from one partition to another, I accidentally
generated a tar archive which seems to stress filesystems in an
interesting way when extracted. It's a rather narrow but real-world
benchmark. The contents are those of a backup hierarchy generated by
BackupPC[1]. The size of the tar (uncompressed) is about 492
gibibytes. The first 490 GiB are extracted quickly (>100 MiB/s), the
extraction of the last 2 GiB takes hours.

I've run some benchmarks on ext4 and btrfs, and very limited
benchmarks on xfs, jfs, nilfs2 and zfs-fuse, all mounted with noatime
but otherwise default options. The major focus was on ext4 behavior
after establishing the ballpark figures of how well each filesystem
performs compared to others. I'm posting the results in the hope that
they will be of interest at least to filesystem and device-mapper
developers.

The results are at

   http://www.hut.fi/~sliedes/fsspeed/fsspeed.html

and also summarized here.

I ran the benchmarks on a raw disk partition (/dev/sdd1), on LVM
device-mapper linear mapping to the same partition, and on LVM dm
linear mapping and dm-crypt on top of it. The disk is not
simultaneously used for anything else.

Basically the directory hierarchy consists of

* directory backuppc/cpool - a pool of compressed backed up files
  - The individual files are of the form
    backuppc/cpool/f/1/d/f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
  - There are some 456000 files in the pool, all hashed by their
    SHA256 sum
  - These take the first about 490 gibibytes of the archive

* directory backuppc/pc with individual backups of a few machines
  - Essentially contains the root filesystems of quite normal desktop
    Linux machines.
  - All backed up files are hardlinks to the pool.
  - Files with size 0 are not hardlinked but represented as themselves.
  - Contains some 10.4M hard links to files in the pool, interspersed
    with some 84300 empty files and very few regular files

The benchmarks were all I/O-bound on the speed of the target
filesystem and partition. This was achieved by making a copy of the
tar that has all the file contents zeroed and using a fast compressor
(lzop) so the decompression of the .tar.lzop was still easily
target-disk-bound.

All the benchmarks were run on a Core 2 Quad Q6600 2.40 GHz computer
with 8 GiB of RAM and vanilla kernel 2.6.33.2. The computer was not a
dedicated machine, however load was otherwise fairly light and never
on the /dev/sdd disk. I tend to think this as a benefit, since this
way the results represent performance as it varies under light load,
if the load has any effect. /dev/sdd is a 1500 gigabyte Seagate
Barracuda 7200.11 disk and /dev/sdd1 is the only partition on it.

I benchmarked two different things: The behavior
of extracting the tar, and after that (without re/unmounting in
between) rm -rf'ing the resulting directory structure.

I think the most surprising result of the benchmarks is that ext4
seems to be significantly faster under LVM than on the raw /dev/sdd1
partition. Without LVM the variance in run times is much larger.

Here are the numbers for ext4 with and without LVM. For the raw
results and some nice graphs, see [2]. All times are in seconds.

Average is the average time taken, min the minimum and max the
maximum. stddev is the sample standard deviation of the sample and
stddev% as percentage from average.

If we assume that the times are normally distributed (which they
possibly aren't exactly), the real mean/average lies between "avg min"
and "avg max" with 95% confidence, the distance from average being the
number in the "avg =B1 95%" column.

Caveat: I'm not much of a statistician, so I might have made some
mistakes. See [2] for the calculations.

------------------------------------------------------------
ext4, /dev/sdd1 (no lvm), noatime, 9 samples:

                     extract        rm
average              20638	    17518.11
min		     18567	    14913
max		     25670	    21196
stddev		     2137.86	    2060.71
stddev %	     10.359%	    11.763%
avg =B1 95%            1440.96	    1388.96
avg =B1 %		     6.982%	    7.929%
avg min	             19197.49	    16129.15
avg max(*)	     22079.40	    18907.07
------------------------------------------------------------
ext4, LVM, dm linear mapping on /dev/sdd1, 7 samples:

# dmsetup table
[...]
rootvg-testdisk: 0 2720571392 linear 8:49 384

                     extract        rm
average		     19475.7	    13634.6
min		     18006	    12036
max		     20247	    14367
stddev		     738.341	    763.391
stddev %	     3.791%	    5.599%
avg =B1 95%	     570.136	    589.479
avg =B1 %		     2.927%	    4.323%
avg min		     18905.6	    13045.1
avg max		     20045.9	    14224.0
------------------------------------------------------------
ext4, LVM and dm-crypt, dm linear mapping on /dev/sdd1, 5 samples:

# dmsetup table
rootvg-testdisk: 0 2720571392 linear 8:49 384
rootvg-testdisk_crypt: 0 2720569336 crypt aes-cbc-essiv:sha256 000000000000=
0000000000000000000000000000000000000000000000000000 0 253:3 2056

                     extract        rm
average		     25541.4	    15690.4
min		     22618	    13766
max		     28550	    18018
stddev		     2315.48	    1687.58
stddev %	     9.066%	    10.756%
avg =B1 95%	     2159.19	    1573.67
avg =B1 %		     8.454%	    10.030%
avg min		     23382.2	    14116.7
avg max		     27700.6	    17264.1
------------------------------------------------------------

With dm-crypt, the extraction is CPU bound on the computer, so the
extraction benchmark is not so interesting there. However what is
interesting is the rm benchmark; it seems that ext4 under LVM is
consistently faster than ext4 on the raw partition (no LVM).

So far I haven't seen the same effect on btrfs, however my btrfs
sample size is quite small (see [2] for the measurements).

btrfs performs significantly better in terms of run time than ext4.
However in a sense it fails the test: A number of the hard links are
actually not extracted, since btrfs places quite heavy restrictions on
the number of hard links (kernel bugzilla #15762). The number of hard
links that failed to extract is so low however that I do not believe
it has an effect on the times. Here are the numbers for btrfs:

------------------------------------------------------------
btrfs, /dev/sdd1 (no lvm), noatime, 5 samples:

                     extract        rm
average		     13514	    8805.75
min		     12424	    8024
max		     15202	    9489
stddev		     1186.96	    623.663
stddev %	     8.783%	    7.082%
avg =B1 95%	     1262.56	    663.39
avg =B1 %		     9.343%	    7.534%
avg min		     12251.4	    8142.36
avg max		     14776.6	    9469.14
------------------------------------------------------------

Also a significant difference in rm performance to ext4, if we plot
the number of files removed so far as a function of time (see [2]), is
that while ext4 offers a fairly steady performance in terms of files
removed per second - it's mostly a straight line - btrfs clearly
accelerates towards the end. On ext4 a single unlink() syscall may
pause for at most tens of seconds, and even that's rare, while on
btrfs the process frequently stalls for as much as minutes, blocked in
unlink(). So while btrfs is faster in terms of total time, it can
create huge latencies for the calling process.

One thing that I have figured out affects the rm time on ext4 is the
order in which rm happens to remove the directories, i.e. pool first
or the backuppc/pc directory first. Removing the pool first seems to
cause the remaining operation to be faster. This is visible in the
plots of number of removed files so far as function of time as a steep
short curve either in the beginning of the operation (pool removed
first) or the end (pool removed last). [2]

On XFS, extracting the archive took more than 16 hours (58548 seconds;
only one sample), so XFS performs rather poorly in this I assume quite
pathological case. On JFS, extraction took more than 19 hours (69365
seconds) and rm took 52702 seconds. On zfs-fuse I did not wait for the
extraction to finish, but a rough guesstimate is that it would have
taken slightly over a week.

The one filesystem that really beat every other in this particular
benchmark was nilfs2. On it, the metadata-heavy hard link extraction
phase was only slightly slower than the data-heavy pool extraction
phase. While on ext4 typically the data-heavy part took maybe 1.3
hours and the rest took 3-4 hours, on nilfs2 the data-heavy part took
2 hours and the metadata-heavy typically pathological part only 15
minutes. Also rm only took 40 minutes (2419 seconds; only one sample
here too).

***Like btrfs, nilfs2 too failed to support enough hard links to
contain the directory structure***. Again most files were correctly
extracted, so I don't believe that had a major impact on efficiency.

See the graphs, data and further analysis at [2].

	Sami


[1] http://backuppc.sourceforge.net/
[2] http://www.hut.fi/~sliedes/fsspeed/fsspeed.html

--eNMatiwYGLtwo1cJ
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iQIcBAEBCAAGBQJL5CJOAAoJEKLT589SE0a0pngP/RDHCfo7Q28CMXjdiKRYIWrF
huChKij19tx02caK+nbxsQTawsXCMg/6RGugnBsyHQP/argUIBUQAczuCdDMTe6m
tEpMaPXAQus4DVrlLi+fXY4fTxkAgoDG7G4A87pR418PPFp9seoUSXCGpiRw1cXN
Cp9TO8VfppSrC8PrvCyChEEMT5rmd8b2dE+nnmfpEGUfnrbs/nD9yxiVt1tnYshC
8XhBdkny+q6lb7fOHZe+u7bG22D3cUakCEsO8qAO2wG6/wGBomvWSSgP60j/DG+w
p23rTGnqDu/OlytCsLYwaOPyArzsNeerOMIz+Q4bVfrbBuHuNU0ZousduhpWduFa
4KuUaFk1FckLWBUFFxH6ASepewpLp2+c9ZoDBrgSEgWSwNE/raf8PsM8P8dHw1th
3LL1SCbgkesj0kjtDRK5xdg0xQTFfmWQ2qo5ustlAh9fJgPb3qrLl5KCLinWsVaJ
U18kls0foqxZJkLSKv88xpOoKYWIS47FXKIUyQBVv257sW80G7Pdjcyv994vpQ6y
JlsLxdu5aMR0cdcjcDeMMTHJhT4/FId9xWT3+TUyUCV30yc1UKr1ShjkkjEB17Iw
pcx1Y7UWvDumDaxxFN7VIkt/rBTXygQBw1pH5xTDj6hqDXTOk5Ld5RLD1rGMQgbF
CimguxNBJCd4K25gypSK
=/0po
-----END PGP SIGNATURE-----

--eNMatiwYGLtwo1cJ--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/