'Re: [webkit-dev] Iterating SunSpider'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       webkit-dev
Subject:    Re: [webkit-dev] Iterating SunSpider
From:       Peter Kasting <pkasting () google ! com>
Date:       2009-07-08 0:42:22
Message-ID: d62cf1d10907071742w761a07bcu8fefc4f885d92b0d () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak <mjs@apple.com> wrote:

> On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:
>
>> For example, the framework could compute both sums _and_ geomeans, if
>> people thought both were valuable.
>>
>
> That's a plausible thing to do, but I think there's a downside: if you make
> a change that moves the two scores in opposite directions, the benchmark
> doesn't help you decide if it's good or not. Avoiding paralysis in the face
> of tradeoffs is part of the reason we look primarily at the total score, not
> the individual subtest scores. The whole point of a meta-benchmark like this
> is to force ourselves to simplemindedly look at only one number.

Yes, I originally had more text like "deciding how to use these scores would
be the hard part", and this is precisely why.

I suppose that if different vendors wanted to use different criteria to
determine what to do in the face of a tradeoff, the benchmark could simply
be a data source, rather than a strong guide.  But this would make it
difficult to use the benchmark to compare engines, which is currently a key
use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that
don't run identical code on every engine [IIRC]).

I think there's one way in which sampling the Web is not quite right. To
> some extent, what matters is not average density of an operation but peak
> density. An operation that's used a *lot* by a few sites and hardly used by
> most sites, may deserve a weighting above its average proportion of Web use.

If I understand you right, the effect you're noting is that speeding up
every web page by 1 ms might be a larger net win but a smaller perceived win
than speeding up, say, Gmail alone by 100 ms.

I think this is true.  One way to capture this would be to say that at least
part of the benchmark should concentrate on operations that are used in the
inner loops of any of n popular websites, without regard to their overall
frequency on the web.  (Although perhaps the two correlate well and there
aren't a lot of "rare but peaky" operations?  I don't know.)

> - GC load

I second this.  As people use more tabs and larger, more complex apps, the
performance of an engine under heavier GC load becomes more relevant.

It would be good to know what other things should be tested that are not
> sufficiently covered.

I think DOM bindings are hard to test and would benefit from benchmarking.
 No public benchmarks seem to test these well today.

* - For example, Mozilla's TraceMonkey effort showed relatively little
> improvement on the V8 benchmark, even though it showed significant
> improvement on SunSpider and other benchmarks. I think TraceMonkey speedups
> are real and significant, so this would tend to undermine my confidence in
> the V8 benchmark's coverage.

I agree that the V8 benchmark's coverage is inadequate and that the example
you mention illuminates that, because TraceMonkey definitely performs better
than SpiderMonkey in my own usage.  I wonder if there may have been an
opposite effect in a few cases where benchmarks with very simple tight loops
improved _more_ under TM than "real-world code" did, but I think the answer
to that is simply that benchmarks should be testing both kinds of code.

PK

[Attachment #5 (text/html)]

<div class="gmail_quote">On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak <span \
dir="ltr">&lt;<a href="mailto:mjs@apple.com">mjs@apple.com</a>&gt;</span> \
wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex;"> <div class="im">On Jul 7, 2009, at 4:19 PM, Peter \
Kasting wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"> For example, the framework could \
compute both sums _and_ geomeans, if people thought both were valuable.<br> \
</blockquote> <br></div>
That&#39;s a plausible thing to do, but I think there&#39;s a downside: if you make a \
change that moves the two scores in opposite directions, the benchmark doesn&#39;t \
help you decide if it&#39;s good or not. Avoiding paralysis in the face of tradeoffs \
is part of the reason we look primarily at the total score, not the individual \
subtest scores. The whole point of a meta-benchmark like this is to force ourselves \
to simplemindedly look at only one number.</blockquote> <div><br></div><div>Yes, I \
originally had more text like &quot;deciding how to use these scores would be the \
hard part&quot;, and this is precisely why.</div><div><br></div><div>I suppose that \
if different vendors wanted to use different criteria to determine what to do in the \
face of a tradeoff, the benchmark could simply be a data source, rather than a strong \
guide.  But this would make it difficult to use the benchmark to compare engines, \
which is currently a key use of SunSpider (and is a key failing, IMO, of frameworks \
like Dromaeo that don&#39;t run identical code on every engine [IIRC]).</div> \
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex;">I think there&#39;s one way in \
which sampling the Web is not quite right. To some extent, what matters is not \
average density of an operation but peak density. An operation that&#39;s used a \
*lot* by a few sites and hardly used by most sites, may deserve a weighting above its \
average proportion of Web use.</blockquote> <div><br></div><div>If I understand you \
right, the effect you&#39;re noting is that speeding up every web page by 1 ms might \
be a larger net win but a smaller perceived win than speeding up, say, Gmail alone by \
100 ms.</div> <div><br></div><div>I think this is true.  One way to capture this \
would be to say that at least part of the benchmark should concentrate on operations \
that are used in the inner loops of any of n popular websites, without regard to \
their overall frequency on the web.  (Although perhaps the two correlate well and \
there aren&#39;t a lot of &quot;rare but peaky&quot; operations?  I don&#39;t \
know.)</div> <div> </div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex;">- GC \
load</blockquote><div><br></div><div>I second this.  As people use more tabs and \
larger, more complex apps, the performance of an engine under heavier GC load becomes \
more relevant.</div> <div><br></div><blockquote class="gmail_quote" style="margin:0 0 \
0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">It would be good to know what \
other things should be tested that are not sufficiently covered.</blockquote> \
<div><br></div><div>I think DOM bindings are hard to test and would benefit from \
benchmarking.  No public benchmarks seem to test these well \
today.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
                .8ex;border-left:1px #ccc solid;padding-left:1ex;">
* - For example, Mozilla&#39;s TraceMonkey effort showed relatively little \
improvement on the V8 benchmark, even though it showed significant improvement on \
SunSpider and other benchmarks. I think TraceMonkey speedups are real and \
significant, so this would tend to undermine my confidence in the V8 benchmark&#39;s \
coverage.</blockquote> <div><br></div><div>I agree that the V8 benchmark&#39;s \
coverage is inadequate and that the example you mention illuminates that, because \
TraceMonkey definitely performs better than SpiderMonkey in my own usage.  I wonder \
if there may have been an opposite effect in a few cases where benchmarks with very \
simple tight loops improved _more_ under TM than &quot;real-world code&quot; did, but \
I think the answer to that is simply that benchmarks should be testing both kinds of \
code.</div> <div><br></div><div>PK</div></div>

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

[prev in list] [next in list] [prev in thread] [next in thread]