--eNMatiwYGLtwo1cJ Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable [Please Cc: the answers to me, I'm not subscribed.] [As this concerns both filesystems and device-mapper, and I'm hesitant to crosspost to kernel lists, I'm posting to lkml. Feel free to continue the discussion on more appropriate lists.] Hello lkml, While moving my backups from one partition to another, I accidentally generated a tar archive which seems to stress filesystems in an interesting way when extracted. It's a rather narrow but real-world benchmark. The contents are those of a backup hierarchy generated by BackupPC[1]. The size of the tar (uncompressed) is about 492 gibibytes. The first 490 GiB are extracted quickly (>100 MiB/s), the extraction of the last 2 GiB takes hours. I've run some benchmarks on ext4 and btrfs, and very limited benchmarks on xfs, jfs, nilfs2 and zfs-fuse, all mounted with noatime but otherwise default options. The major focus was on ext4 behavior after establishing the ballpark figures of how well each filesystem performs compared to others. I'm posting the results in the hope that they will be of interest at least to filesystem and device-mapper developers. The results are at http://www.hut.fi/~sliedes/fsspeed/fsspeed.html and also summarized here. I ran the benchmarks on a raw disk partition (/dev/sdd1), on LVM device-mapper linear mapping to the same partition, and on LVM dm linear mapping and dm-crypt on top of it. The disk is not simultaneously used for anything else. Basically the directory hierarchy consists of * directory backuppc/cpool - a pool of compressed backed up files - The individual files are of the form backuppc/cpool/f/1/d/f1d2d2f924e986ac86fdf7b36c94bcdf32beec15 - There are some 456000 files in the pool, all hashed by their SHA256 sum - These take the first about 490 gibibytes of the archive * directory backuppc/pc with individual backups of a few machines - Essentially contains the root filesystems of quite normal desktop Linux machines. - All backed up files are hardlinks to the pool. - Files with size 0 are not hardlinked but represented as themselves. - Contains some 10.4M hard links to files in the pool, interspersed with some 84300 empty files and very few regular files The benchmarks were all I/O-bound on the speed of the target filesystem and partition. This was achieved by making a copy of the tar that has all the file contents zeroed and using a fast compressor (lzop) so the decompression of the .tar.lzop was still easily target-disk-bound. All the benchmarks were run on a Core 2 Quad Q6600 2.40 GHz computer with 8 GiB of RAM and vanilla kernel 2.6.33.2. The computer was not a dedicated machine, however load was otherwise fairly light and never on the /dev/sdd disk. I tend to think this as a benefit, since this way the results represent performance as it varies under light load, if the load has any effect. /dev/sdd is a 1500 gigabyte Seagate Barracuda 7200.11 disk and /dev/sdd1 is the only partition on it. I benchmarked two different things: The behavior of extracting the tar, and after that (without re/unmounting in between) rm -rf'ing the resulting directory structure. I think the most surprising result of the benchmarks is that ext4 seems to be significantly faster under LVM than on the raw /dev/sdd1 partition. Without LVM the variance in run times is much larger. Here are the numbers for ext4 with and without LVM. For the raw results and some nice graphs, see [2]. All times are in seconds. Average is the average time taken, min the minimum and max the maximum. stddev is the sample standard deviation of the sample and stddev% as percentage from average. If we assume that the times are normally distributed (which they possibly aren't exactly), the real mean/average lies between "avg min" and "avg max" with 95% confidence, the distance from average being the number in the "avg =B1 95%" column. Caveat: I'm not much of a statistician, so I might have made some mistakes. See [2] for the calculations. ------------------------------------------------------------ ext4, /dev/sdd1 (no lvm), noatime, 9 samples: extract rm average 20638 17518.11 min 18567 14913 max 25670 21196 stddev 2137.86 2060.71 stddev % 10.359% 11.763% avg =B1 95% 1440.96 1388.96 avg =B1 % 6.982% 7.929% avg min 19197.49 16129.15 avg max(*) 22079.40 18907.07 ------------------------------------------------------------ ext4, LVM, dm linear mapping on /dev/sdd1, 7 samples: # dmsetup table [...] rootvg-testdisk: 0 2720571392 linear 8:49 384 extract rm average 19475.7 13634.6 min 18006 12036 max 20247 14367 stddev 738.341 763.391 stddev % 3.791% 5.599% avg =B1 95% 570.136 589.479 avg =B1 % 2.927% 4.323% avg min 18905.6 13045.1 avg max 20045.9 14224.0 ------------------------------------------------------------ ext4, LVM and dm-crypt, dm linear mapping on /dev/sdd1, 5 samples: # dmsetup table rootvg-testdisk: 0 2720571392 linear 8:49 384 rootvg-testdisk_crypt: 0 2720569336 crypt aes-cbc-essiv:sha256 000000000000= 0000000000000000000000000000000000000000000000000000 0 253:3 2056 extract rm average 25541.4 15690.4 min 22618 13766 max 28550 18018 stddev 2315.48 1687.58 stddev % 9.066% 10.756% avg =B1 95% 2159.19 1573.67 avg =B1 % 8.454% 10.030% avg min 23382.2 14116.7 avg max 27700.6 17264.1 ------------------------------------------------------------ With dm-crypt, the extraction is CPU bound on the computer, so the extraction benchmark is not so interesting there. However what is interesting is the rm benchmark; it seems that ext4 under LVM is consistently faster than ext4 on the raw partition (no LVM). So far I haven't seen the same effect on btrfs, however my btrfs sample size is quite small (see [2] for the measurements). btrfs performs significantly better in terms of run time than ext4. However in a sense it fails the test: A number of the hard links are actually not extracted, since btrfs places quite heavy restrictions on the number of hard links (kernel bugzilla #15762). The number of hard links that failed to extract is so low however that I do not believe it has an effect on the times. Here are the numbers for btrfs: ------------------------------------------------------------ btrfs, /dev/sdd1 (no lvm), noatime, 5 samples: extract rm average 13514 8805.75 min 12424 8024 max 15202 9489 stddev 1186.96 623.663 stddev % 8.783% 7.082% avg =B1 95% 1262.56 663.39 avg =B1 % 9.343% 7.534% avg min 12251.4 8142.36 avg max 14776.6 9469.14 ------------------------------------------------------------ Also a significant difference in rm performance to ext4, if we plot the number of files removed so far as a function of time (see [2]), is that while ext4 offers a fairly steady performance in terms of files removed per second - it's mostly a straight line - btrfs clearly accelerates towards the end. On ext4 a single unlink() syscall may pause for at most tens of seconds, and even that's rare, while on btrfs the process frequently stalls for as much as minutes, blocked in unlink(). So while btrfs is faster in terms of total time, it can create huge latencies for the calling process. One thing that I have figured out affects the rm time on ext4 is the order in which rm happens to remove the directories, i.e. pool first or the backuppc/pc directory first. Removing the pool first seems to cause the remaining operation to be faster. This is visible in the plots of number of removed files so far as function of time as a steep short curve either in the beginning of the operation (pool removed first) or the end (pool removed last). [2] On XFS, extracting the archive took more than 16 hours (58548 seconds; only one sample), so XFS performs rather poorly in this I assume quite pathological case. On JFS, extraction took more than 19 hours (69365 seconds) and rm took 52702 seconds. On zfs-fuse I did not wait for the extraction to finish, but a rough guesstimate is that it would have taken slightly over a week. The one filesystem that really beat every other in this particular benchmark was nilfs2. On it, the metadata-heavy hard link extraction phase was only slightly slower than the data-heavy pool extraction phase. While on ext4 typically the data-heavy part took maybe 1.3 hours and the rest took 3-4 hours, on nilfs2 the data-heavy part took 2 hours and the metadata-heavy typically pathological part only 15 minutes. Also rm only took 40 minutes (2419 seconds; only one sample here too). ***Like btrfs, nilfs2 too failed to support enough hard links to contain the directory structure***. Again most files were correctly extracted, so I don't believe that had a major impact on efficiency. See the graphs, data and further analysis at [2]. Sami [1] http://backuppc.sourceforge.net/ [2] http://www.hut.fi/~sliedes/fsspeed/fsspeed.html --eNMatiwYGLtwo1cJ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBCAAGBQJL5CJOAAoJEKLT589SE0a0pngP/RDHCfo7Q28CMXjdiKRYIWrF huChKij19tx02caK+nbxsQTawsXCMg/6RGugnBsyHQP/argUIBUQAczuCdDMTe6m tEpMaPXAQus4DVrlLi+fXY4fTxkAgoDG7G4A87pR418PPFp9seoUSXCGpiRw1cXN Cp9TO8VfppSrC8PrvCyChEEMT5rmd8b2dE+nnmfpEGUfnrbs/nD9yxiVt1tnYshC 8XhBdkny+q6lb7fOHZe+u7bG22D3cUakCEsO8qAO2wG6/wGBomvWSSgP60j/DG+w p23rTGnqDu/OlytCsLYwaOPyArzsNeerOMIz+Q4bVfrbBuHuNU0ZousduhpWduFa 4KuUaFk1FckLWBUFFxH6ASepewpLp2+c9ZoDBrgSEgWSwNE/raf8PsM8P8dHw1th 3LL1SCbgkesj0kjtDRK5xdg0xQTFfmWQ2qo5ustlAh9fJgPb3qrLl5KCLinWsVaJ U18kls0foqxZJkLSKv88xpOoKYWIS47FXKIUyQBVv257sW80G7Pdjcyv994vpQ6y JlsLxdu5aMR0cdcjcDeMMTHJhT4/FId9xWT3+TUyUCV30yc1UKr1ShjkkjEB17Iw pcx1Y7UWvDumDaxxFN7VIkt/rBTXygQBw1pH5xTDj6hqDXTOk5Ld5RLD1rGMQgbF CimguxNBJCd4K25gypSK =/0po -----END PGP SIGNATURE----- --eNMatiwYGLtwo1cJ-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/