'Re: While we're at it'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       djbdns
Subject:    Re: While we're at it
From:       Dean Anderson <dean () av8 ! com>
Date:       2010-04-08 22:05:13
Message-ID: Pine.LNX.4.44.1004081757540.6365-100000 () citation2 ! av8 ! net
[Download RAW message or body]

On Thu, 8 Apr 2010, James Sutherland wrote:

> 2010/4/7 Dean Anderson <dean@av8.com>:
> > On Mon, 5 Apr 2010, Colm MacCárthaigh wrote:
> (snip)
> > But a complete word is brought into L2 cache to read one byte. When
> > aligned the next 3 tests access L2 cache at high speed and are
> > pipelined.   A conditional jump to do one test at a time would slow it
> > down because the pipeline is flushed when the branch is taken.
> 
> L1 cache not L2 - and the pipeline should only be flushed if the branch
> is mispredicted. Any half-decent branch prediction should know the
> conditional branch back to check the next byte (JNZ in x86-speak) is
> taken most of the time, so the misprediction will only occur when
> breaking out of the loop to return the result.
> 
> On the original SPARC and MIPS static branch predictors, the first
> three lines of DJB's unrolled routine will be "predicted" correctly for
> the common case of a non-zero byte; only the loop back to the
> beginning will be mis-predicted. The later refinement, assuming
> that only backward branches are taken, will be correct for all except
> the final byte.

Very nice explanation. Thanks!

> > Sure. But one wants code that is fast in the most common case, and works
> > in the less common case.
> 
> When parsing HTTP requests or DNS packets, I would expect nearly 75% of
> the initial data to be non-word-aligned (87.5% on machines with 64 bit words).
> With tinydns, the static data.cdb content already contains the length of each
> object, so there shouldn't be any strlen-type calls there either.

DNS packets never require strlen when decoding, only encoding. And then
it may depend on the RR representation in the server. HTTP is known to
be a cluster!@#$%.

> With the simple byte by byte comparison, it doesn't really matter. I would
> not be at all surprised to find the extra overhead in the "clever" word
> by word version made it slower than the unrolled and branch-predicted
> byte approach.
>
> Of course, experimenting would be good - I'll see if I can find time to
> round up some actual figures, at least on x86 (-64) in the next few
> days. I would expect DJB did some profiling back when he first wrote
> that code and that's how he ended up with the 4 line unroll; I
> seem to recall he tended to check timing on SPARC and MIPS as well.

Thanks!

		--Dean

-- 
Av8 Internet   Prepared to pay a premium for better service?
www.av8.net         faster, more reliable, better service
617 256 5494


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic