From linux-kernel Mon Aug 23 15:41:14 2004 From: Jens Axboe Date: Mon, 23 Aug 2004 15:41:14 +0000 To: linux-kernel Subject: Re: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit Message-Id: <20040823154113.GZ2301 () suse ! de> X-MARC-Message: https://marc.info/?l=linux-kernel&m=109327647405170 On Mon, Aug 23 2004, Marcelo Tosatti wrote: > On Sun, Aug 22, 2004 at 09:18:51PM +0200, Karl Vogel wrote: > > When using elevator=as I'm unable to trigger the swap of death, so it seems > > that the CFQ scheduler is at blame here. > > > > With AS scheduler, the system recovers in +-10 seconds, vmstat output during > > that time: > > > > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- > > r b swpd free buff cache si so bi bo in cs us sy id wa > > 1 0 0 295632 40372 49400 87 278 324 303 1424 784 7 2 78 13 > > 0 0 0 295632 40372 49400 0 0 0 0 1210 648 3 1 96 0 > > 0 0 0 295632 40372 49400 0 0 0 0 1209 652 4 0 96 0 > > 2 0 0 112784 40372 49400 0 0 0 0 1204 630 23 34 43 0 > > 1 9 156236 788 264 8128 28 156220 3012 156228 3748 3655 11 31 0 59 > > 0 15 176656 2196 280 8664 0 20420 556 20436 1108 374 2 5 0 93 > > 0 17 205320 724 232 7960 28 28664 396 28664 1118 503 7 12 0 81 > > 2 12 217892 1812 252 8556 248 12584 864 12584 1495 318 2 7 0 91 > > 4 14 253268 2500 268 8728 188 35392 432 35392 1844 399 3 7 0 90 > > 0 13 255692 1188 288 9152 960 2424 1408 2424 1173 2215 10 5 0 85 > > 0 7 266140 2288 312 9276 604 10468 752 10468 1248 644 5 5 0 90 > > 0 7 190516 340636 348 9860 1400 0 2016 0 1294 817 4 8 0 88 > > 1 8 190516 339460 384 10844 552 0 1556 4 1241 642 3 1 0 96 > > 1 3 190516 337084 404 11968 1432 0 2576 4 1292 788 3 1 0 96 > > 0 6 190516 333892 420 13612 1844 0 3500 0 1343 850 5 2 0 93 > > 0 1 190516 333700 424 13848 480 0 720 0 1250 654 3 2 0 95 > > 0 1 190516 334468 424 13848 188 0 188 0 1224 589 3 2 0 95 > > > > With CFQ processes got stuck in 'D' and never left that state. See URL's in my > > initial post for diagnostics. > > I can confirm this on a 512MB box with 512MB swap (2.6.8-rc4). Using CFQ the machine swaps out > 400 megs, with AS it swaps out 30M. > > That leads to allocation failures/etc. > > CFQ allocates a huge number of bio/biovecs: > > cat /proc/slabinfo | grep bio > biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0 > biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0 > biovec-64 265 265 768 5 1 : tunables 54 27 0 : slabdata 53 53 0 > biovec-16 260 260 192 20 1 : tunables 120 60 0 : slabdata 13 13 0 > biovec-4 272 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0 > biovec-1 121088 122040 16 226 1 : tunables 120 60 0 : slabdata 540 540 0 > bio 121131 121573 64 61 1 : tunables 120 60 0 : slabdata 1992 1993 0 > > > biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0 > biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0 > biovec-64 265 265 768 5 1 : tunables 54 27 0 : slabdata 53 53 0 > biovec-16 258 260 192 20 1 : tunables 120 60 0 : slabdata 13 13 0 > biovec-4 257 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0 > biovec-1 66390 68026 16 226 1 : tunables 120 60 0 : slabdata 301 301 0 > bio 66389 67222 64 61 1 : tunables 120 60 0 : slabdata 1102 1102 0 > > (which are freed later on, but the cause for the trashing during the swap IO). > > While AS does: > > [marcelo@yage marcelo]$ cat /proc/slabinfo | grep bio > biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0 > biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0 > biovec-64 260 260 768 5 1 : tunables 54 27 0 : slabdata 52 52 0 > biovec-16 280 280 192 20 1 : tunables 120 60 0 : slabdata 14 14 0 > biovec-4 264 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0 > biovec-1 4478 5424 16 226 1 : tunables 120 60 0 : slabdata 24 24 0 > bio 4525 5002 64 61 1 : tunables 120 60 0 : slabdata 81 82 0 > > > Odd thing is the 400M swapped out are not reclaimed after exp (the 512MB callocator) exits. With AS > almost all swapped out memory is reclaimed on exit. > > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 492828 13308 320 3716 0 0 0 0 1002 5 0 0 100 0 > > > Jens, is this huge amount of bio/biovec's allocations expected with CFQ? Its really really bad. Nope, it's not by design :-) A test case would be nice, then I'll fix it as soon as possible. But please retest with 2.6.8.1 marcelo, 2.6.8-rc4 is missing an important fix to ll_rw_blk that can easily cause this. The first report is for 2.6.8.1, so I'm more puzzled on that. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/