'[gentoo-desktop] Re: @system and parallel merge speedup'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gentoo-desktop
Subject:    [gentoo-desktop] Re: @system and parallel merge speedup
From:       Duncan <1i5t5.duncan () cox ! net>
Date:       2012-10-22 6:26:05
Message-ID: pan.2012.10.22.06.26.06 () cox ! net
[Download RAW message or body]

Alex Efros posted on Sun, 21 Oct 2012 16:24:32 +0300 as excerpted:

> Hi!
> 
> On Sun, Oct 21, 2012 at 08:02:47AM +0000, Duncan wrote:
>> Bottom line, an empty @system set really does make a noticeable
>> difference in parallel merge handling, speeding up especially
>> --emptytree @world rebuilds but also any general update that has a
>> significant number of otherwise @system packages and deps,
>> dramatically.  I'm happy. =:^)
> 
> I think "@system first" and "@system not merge in parallel" rules are
> safe to break when you just doing "--emptytree @world" on already
> updated OS because it's only rebuild existing packages, and all packages
> while compiling will see same set of other packages (including same
> versions). But when upgrading multiple packages (including some from
> original @system and some from @world) this probably may result in bugs.

In theory, you're right.  In practice, I've not seen it yet, tho being 
cautious I'd say it needs at least six months of testing (I've only been 
testing it about a month, maybe six weeks) before I can say for sure.  
It /was/ something I was a bit concerned about, however.

That was in fact one of the reasons I decided to try it on the netbook's 
chroot as well, which hadn't been upgraded in a year and a half.  I 
figured if it could work reasonably well there, the chances of an 
undiscovered real problem were much lower.

However, it /is/ worth noting that as a matter of course, I already often 
choose to do some system-critical upgrades (portage, gcc, glibc, openrc, 
udev) on their own, before doing the general upgrades, in part so I can 
deal with their config file changes and note any problems right away, 
with a relatively small changeset to deal with, as opposed to having a 
whole slew of updates including critical system package updates happen 
all at once, thus making it far more difficult to trace which update 
actually broke things.

That's where the years of gentoo experience I originally mentioned comes 
in.  This isn't going to be as easy for a gentoo newbie for at least two 
reasons.  First, they're less likely to know what packages really /are/ 
system critical, and thus are more likely to unmerge them without the 
extra unmerge warning a package in the system set gets.  (I mentioned 
that one in the first post.) Second, spotting critical updates in the 
initial --pretend run, knowing which packages it's a good idea to upgrade 
first, by themselves, dealing with config file updates, etc, for just 
that critical package (and any dependency updates it might pull in), 
before going on to the general @world upgrade, probably makes a good bit 
of difference in practice, and gentoo newbies are rather less likely to 
be able to make that differentiation.  (I didn't specifically mention 
that one until now.)

> As for "--emptytree @world" speedup, can you provide benchmarked values?
> I mean, only few packages forced to use only one CPU Core while
> compiling.
> So, merging packages in parallel may save some time mostly for doing
> unpack/prepare/configure/install/merge. All of them except configure
> actually do a lot of I/O, which most likely lose a lot in speed instead
> of gain when done in parallel (especially keeping in mind kernel bug
> 12309). So, at a glance time you may win on configure you'll mostly lose
> on I/O, and most of time all your CPU Cores will be loaded anyway while
> compiling, and doing configure in parallel to compiling unlikely save
> some time. This is why I think without actual benchmarking we can't be
> sure how faster it became (if it became faster at all, which is
> questionable).

Good points, and no, I can't easily provide benchmarks, both because of 
the recent hardware upgrade here, and because portage itself has been 
gradually improving its parallel merging abilities -- a recent update 
changed the scheduling algorithm so it starts additional merges much 
sooner than it did previously. (See gentoo bug 438650 fixed in portage 
2.1.11.29 and 2.2.0_alpha140, both released on Oct 17.  That I know about 
that hints at another thing I do routinely as an experienced gentooer: I 
always read portage's changelog and check out any referenced bugs that 
look interesting, before I upgrade portage.  To the extent practical 
without actually reading the individual git commits, I want to know about 
package manager changes that might affect me BEFORE I do that upgrade!)

But, I believe as core-counts rise, you're underestimating the effects of 
portage's parallel merging abilities.  In particular, a lot of packages 
normally in @system (or deps thereof) are relatively small packages such 
as grep, patch, sed... where the single-threaded configure step takes a 
MUCH larger share of the total package merge time than it does with 
larger packages.  Similarly, the unpack and prepare phases, plus the 
package phase for folks using FEATURES=binpkg, tend to be
single-threaded.[1]

Thus, instead of serializing several dozen small mostly single-threaded 
package merges for packages like grep/sed/patch/util-linux/etc, depending 
on the --jobs and --load-average numbers you feed to portage, several of 
these end up getting done in parallel, with the portage multi-job output 
bumping a line every few seconds because it's doing them in parallel, 
instead of every minute or so, because it's doing one at a time.

Meanwhile, it should be obvious, but it's worth stating anyway.  The 
effect gets *MUCH* bigger as the number of cores increases.  For a dual-
core, bah, not worth the trouble, as it could cause more problems then it 
solves, especially if people are trying to work on other things while 
portage is doing its thing in the background.  I suspect the break-over 
point is either triple-core or quad-core.  One of the reasons portage is 
getting better lately is because someone's taken an interest that has a 
32-core, with a corresponding amount of memory (64 or 128 gig IIRC).

It's worth noting, as I mentioned, that I now have a 6-core, recently 
upgraded from a dual-dual-core (4 cores), with a corresponding memory 
upgrade, to 16 gigs.

One of the first things I noticed doing emerges was how much more 
difficult it was to keep the 6-core actually peaked out to 100% CPU, than 
it had been the 4-core.  While I suspect there would have been a 
difference on the quad-core (as I said I believe the break-over's 
probably 3-4 cores), it wasn't a big deal there.  Staring at that 6-core 
running at 100% on 1-2 cores CPU-freq-maxed at 3.6 GHz, while the other 
4-5 cores remained near idle at <20% utilization at CPU-freq-minimum 1.4 
GHz... was VERY frustrating.  So began my drive to empty @system and get 
portage properly scheduling parallel merges for former @system packages 
and their deps as well!

For the quad-core plus hyperthreading (thus 8 threads I take it?) you 
mention below (4.6 GHz OC, nice! I see stock is 3.4 GHz), the boost from 
killing @system forced serialization should definitely make a difference 
(unless the hyperthreading doesn't do much for that work load, making it 
effectively no better than a non-hyperthreaded quad-core.  For my 6-core, 
it made a rather big difference, and I guarantee if you had the 32-core 
that one of the devs working on improving portage's parallelization has, 
you'd be hot on the trail to improve it as well!

> As for me, I found very effective way to speedup emerge is upgrading
> from Core2Duo E6600 to i7-2600K overclocked to 4.6GHz. This speedup
> compilation on my system in 6 times (kernel now compiles in just 1
> minute). And to speedup most other (non-compilation) portage operations
> I use 4GB tmpfs mount on /var/tmp/portage/.

I remember reading about the 1-minute kernel compiles on i7s.  Very 
impressive.

FWIW, there's a lot of variables to fill in the blank on, before we can 
be sure kernel build time comparisons are apples to apples (I had several 
more paragraphs written on that, but decided it was a digression too far 
for this post so deleted 'em), but AFAIK when I read about it (on phoronix 
I believe), he was doing an all-yes config, so building rather more than 
a typical customized-config gentooer, but was using a rather fast SSD, 
which probably improved his times quite a bit compared to "spinning rust".

But I don't know if his timings included the actual compress (and if so 
with what CONFIG_KERNEL_XXX compression option) and I don't believe they 
included the actual install, only the build.

That said, a 1-minute all-yes-config kernel build time is impressive 
indeed, the envy of many, including me.  (OTOH, my fx6100 was on sale for 
$100, $109 post-tax.  That's lower than pricewatch's $118 lowest quote 
(shipped, no tax), and only about 40% of the $273 low quote for an 
i7-2600k.)

My build, compress (CONFIG_KERNEL_XZ) and install, runs ~2 minutes 
(1:58-2:07, 10+ runs, warm-cache), so yes, even if your build time 
doesn't include compress and install, which it might, 1-minute is still 
VERY impressive.  Tho as I said, my CPU cost ~40% of the going price on 
yours, so...

Meanwhile...

I too use and DEFINITELY recommend a tmpfs $PORTAGE_TMPDIR.  I'm running 
16 gig RAM here, and didn't want to run out of room with parallel builds, 
so set a nice roomy 12G tmpfs size.

A $PORTAGE_TMPDIR on tmpfs also reduces the I/O.  At least here, the only 
time I've had problems, both on the old hardware and on the new, is when 
I go into swap.  (And on the old hardware I had swap priority= striped 
across four disks and 4-way md/raid0, so the kernel could schedule swap-
out vs read-in much better and I didn't see a problem until I hit nearly 
half-gig of swap loading at once; the new hardware is only single-disk 
ATM, and I see issues starting @ 80 meg or so of swap loading, at once.)  
But with 16 gig RAM on the new system, the only time I see it go into 
swap is when I run a kernel build with uncapped -j, thus hitting 500+ 
jobs and close enough to 16 gigs that whether I hit swap or not depends 
on what else I've been doing with the system.

Basically, I/O is thus not a problem at all with portage, here, up to the
--jobs=12 --load-average=12 along with MAKEOPTS="-j20 -l15" I normally 
run, anyway.  On the old system with only six gigs of RAM, if I tried 
hard enough I could get portage to hit swap there, but I limited --jobs 
and MAKEOPTS until that wasn't an issue, and had no additional problems.

Tho I should mention I also run PORTAGE_NICENESS=19 (and my kernel-build/
install script similarly renices itself to 19 before starting the kernel 
build), which puts it in batch-scheduling mode (idle-only scheduling, but 
longer timeslices).

If it matters, filesystem is reiserfs, iosched is cfq, drive is sata2/ahci 
(amd 990fx/sb950 chipset) 2.5" seagate "spinning rust".

But I definitely agree with $PORTAGE_TMPDIR on tmpfs.  It makes a HUGE 
difference!

---
[1] Compression parallelism:  There are parallel-threaded alternatives to 
bzip2, for instance, but they have certain down-sides like decompress 
only being parallel where the tarball was compressed with the same 
parallel tool, and certain compression buffer nul-fill handling 
differences that make them not functionally perfect drop-in replacements. 
See the recent discussion on the topic on the gentoo-dev list for 
instance.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic