[prev in list] [next in list] [prev in thread] [next in thread] 

List:       zlib-devel
Subject:    [Zlib-devel] [7/8][RFC V3 Patch] Blackfin implementation
From:       kaffeemonster () googlemail ! com (Jan Seiffert)
Date:       2011-05-09 17:42:05
Message-ID: BANLkTi=vzLoL3tVjAqodmWq6twjs+7B9fA () mail ! gmail ! com
[Download RAW message or body]

2011/5/9 Mike Frysinger <vapier at gentoo.org>:
> here's some feedback i got internally:
> 

Ahh, very nice!

> A spent a little more time on it today, but the code looks pretty well
> optimized. ?Actually this is an impressive piece of work.
> 

*blush*
Thanks

> As long as the input data is cached, this code is going to be run pretty
> quickly for longer vector lengths; the inner loop is pretty tight.
> 
[snip - code]
> 
> 
> If the 'buf' (I0) input data is not cached, the speed is going to be
> bound by the speed of the external memory interface. ?The inner loop is
> 13 instruction with 2 32 bit fetches from I0. ?Depending on the
> SCLK:CCLK ratio, and the memory type, the core will be waiting around
> for completion of memory fetches.

Yes, that was the problem i also identified, but couldn't find a solution for.
I can not move the data fetches early enough (to few register), so
instead i went to entangle the vord_* fetches as aggressive as
possible.
Trying to eliminate the vord_* fields, to get more register for early
fetches, i need to calculate some other multiplication constant, but
that adds more instructions to the inner loop which makes it slower
overall, even if then the data fetches are early.

> ?In that case all this optimization work will not be effective.
> 

The serial code also has to wait for the memory fetches ;)
But see further down.

> If zlib has very recently read/written the input data,

On compression adler32 often touches the data first (if not program
generated, but then it is still often memcpy'd into the buffer by the
kernel), on decompression adler32 touches the data last.

> and write-back
> cache is turned on, and the data block size is less than (say) 16kB then
> the speed up could be realized. ?I guess taking benchmarks of zlib is
> what you are doing?
> 

Not the complete zlib code base (the most easy accessible program is
minigzip, which creates gzip archives, which use crc32 if i'm not
mistaken). So i had to create test code anyway and concentrated on
pure adler32 testing (so i can see little differences better, esp. on
other archs). It does most of the tests you suggest later.

I am doing them with caches turned on and with a blocksize of 160kb.
Which is larger then the cache. And still see an speedup factor of
1.79. Or are the caches in some way prefetching in the background?
If i limit the size to 16kb i see a speedup factor of 2.25.

> vord_e and vord_o (I2, I3) should be ideally placed so they are in
> separate cache banks or else this is going to stall a lot.
> 

Yes. Then the speedup factor at 160kb goes to 1.84.
But i left it out of the generic code because:
a) Had little problems with different compiler versions/binary formats
(not all (prop.) support putting the constants)
b) Space in the cache banks is a scarce resource, so i left that for
people who know they have some space left there.

[snip]
> I am not sure of the need to set the circular buffer registers for I0?
> Unless it is for added security against bugs/buffer overflows...

To be honest, i was guided by the misconception that addressing over
In would not work when those register are not properly setup. Or more
correct i'm still a little fuzzy with it. For example the ABI says
that on function entry/exit the length register have to be cleared,
but AFAIK not the base register. But as soon as the base register are
set, doesn't that trigger circular addressing? Even with len=0?
At least with that circular addressing the pointer never wandered into
lala land during development.


> 
> ? ?e: ?12 32 ? ? ? ? ? P2 = R2; ? ? ?/* len parameter */
> 
> ? 22: ?01 34 ? ? ? ? ? I0 = R1; ? ? ?/* buf parameter */
> ? 24: ?01 36 ? ? ? ? ? B0 = R1;
> ? 26: ?4a 32 ? ? ? ? ? P1 = P2;
> 
> ? 2c: ?21 6c ? ? ? ? ? P1 += 0x4; ? ? ? ? ? ? ?/* ( ?4) */ ? ?// Why add four here? \
> ?B/c of disaligned fetch??? 

Yes, to not wrap even with one fetch above, which is needed in the
unaligned case. I wasn't quite sure which bytes would be fetched when
the wrap happens.

> ? 34: ?61 36 ? ? ? ? ? L0 = P1;
> 
> Testing
> ----------
> 
> The important one is testing. ?The author should compare results with
> the reference code in a test shell.
> 

Done

> Here is the API function.
> 
> local noinline uLong adler32_vec(adler, buf, len)
> ? ? uLong adler;
> ? ? const Bytef *buf;
> ? ? uInt len;
> {
> 

inline asm ftw, not my problem, the compiler does that.

> Some tests should include
> 
> a/ changing alignment of buf, to test various combinations.

The different lines in the speed check are for ptr + offset, len - offset
So the tested alignments are:
0   (often matches with the malloc alignment, often 64Bit aligned)
+1 (one off odd)
+2 (which then happens to be 16 Bit aligned)
+4 (which then happens to be 32 Bit aligned)
+8 (which then happens to be 128 Bit aligned)
+7 with len-16 (which is again odd aligned, gives a worst case for the
start condition on other impl. (only 1 byte of productive work), and
the -16 gives an not dividable end)


> b/ len: various 0..VNMAX, >> VNMAX

The code isn't safe for 0 and other small values, deliberately, that
should be caught by the calling code which dispatches to different
versions for different lengths. This is the _vec sub-function only
activated for lengths > MIN_WORK, which is 16 for blackfin.
The normal test length is 160kb, which is >> VNMAX (at least for me).
Just for kicks i now tested with 8Meg of 0xff, looks good, speedup
factor still 1.79.
Lengths from 16 (the original zlib MIN_WORK) up in 2^n  are tested to
find the perfect MIN_WORK, which was again 16 for blackfin.

> c/ fill buffer with random data, all 0's, all 0xffffffff etc

Random and 0xff are tested, all 0 is not, should do that, thanks for the hint!
And it looks good.

> d/ 'adler' input vary with various input values, 0, 0xffffffff, random
> 
different input values are also tested with the different alignments,
but not 0xffffffff. I should do that.
And that also looks good.

> All I can think of for now.
> 

Thanks for all your input!

Greetings
Jan


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic