[prev in list] [next in list] [prev in thread] [next in thread]
List: haskell-cafe
Subject: Re: [Haskell-cafe] Splittable random numbers
From: Ryan Newton <newton () mit ! edu>
Date: 2011-01-31 6:25:34
Message-ID: AANLkTik+kgtFMQu7HdEYWHzyEt0fhp70-BG9SQgopUQ1 () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Hi Cafe,
I've included Gladman's efficient, portable C implementation of AES
generating random numbers now, and the hardware-accelerated version is
working too. I'm currently seeing higher throughput for even the software
based version than the builtin stdGen:
First, timing with System.Random interface:
13,051,964 random ints generated [System.Random stdGen] ~ 252
cycles/int
15,635 random ints generated [PureHaskell/reference] ~ 210,763
cycles/int
31,159 random ints generated [PureHaskell] ~ 105,757
cycles/int
2,180,488 random ints generated [Gladman inefficient] ~ 1,511
cycles/int
15,015,095 random ints generated [Gladman] ~ 219
cycles/int
That seems like a good argument for cryptographic RNGs to me!
I'm having a lot of trouble getting cabal to build/install it successfully.
You can see what I've got there now. I'd be interested to know if anyone
else can build it successfully. It should work -- but only by building the
assembly code into a .so and assuming the build directory is /opt/intel-aes
;-).
I don't have real numbers for the hardware version yet because the Westmere
machine I'm logged into is redhat 5.4 and is giving me "GLIBC_2.7 not found"
errors. You can run it for correctness purposes using an emulation tool
called sde (software development
emulator)<http://software.intel.com/en-us/articles/intel-software-development-emulator/>that's
based on dynamic binary translation.
-Ryan
P.S. Checkout command:
git clone git://github.com/rrnewton/intel-aes.git
On Sat, Jan 29, 2011 at 8:52 AM, Ryan Newton <rrnewton@gmail.com> wrote:
> perhaps performance? Is this approach less robust with a faster,
>> non-cryptographic RNG?
>>
>
> Yes, I don't understand that either. Is there a reason that using a weaker
> PRNG in this case is WORSE than using it in the non-splitting case? Is that
> why there is more of an impetus to use the cryptographic approach in this
> case?
>
> Anyway, taking for granted that the Burton approach is a useful thing to
> have implemented, I started developing a package for this stuff -- AES based
> RNG including both a haskell implementation and wrapping an AESNI-based C
> one . I haven't posted it to Hackage yet, but you can find the git
> repository here:
>
> https://github.com/rrnewton/intel-aes
>
> If you build with cabal and run the benchmark-intel-aes-rng executable, it
> will give you a breakdown like this:
>
> How many random numbers can we generate in a second on one thread?
> Cost of rdtsc (ffi call): 83
> Approx getCPUTime calls per second: 205,640
> Approx clock frequency: 3,306,891,339
> First, timing with System.Random interface:
> 193,178,901 random ints generated [constant zero gen]
> 14,530,358 random ints generated [System.Random stdGen]
> 16,346 random ints generated [BurtonGenSlow/reference]
> 32,965 random ints generated [BurtonGenSlow]
> Comparison to C's rand():
> 118,766,285 random ints generated [rand/store in C loop]
> 114,668,028 random ints generated [rand / Haskell loop]
> 114,675,116 random ints generated [rand/store Haskell]
>
> At the moment this is Haskell-only, I haven't included the wrapped Intel
> assembly code library yet. As you can see, the pure-Haskell AES based RNG
> (BurtonGenSlow) is pretty slow.
>
> Would anyone else be interested in running those RNG testing tools
> (diehard, big crush) on this to make sure it is working correctly?
>
> Also I'd be happy if anyone with a performance-oriented-eye would like to
> take a look at what's going on. Both for the sake of the serial performance
> (above) and because the parallel performance is currently *mysterious*(see below).
>
> I figure one of the main reasons for splittable RNG is deterministic
> parallel computations. Thus it's important that all threads be able to run
> the RNG efficiently. Right now, if you look at SimpleRNGBench.hs I'm just
> running the same RNG on numCapabilities threads. Yet with that simple test
> I'm running into problems, summarized thus:
>
> * I get substantial (3X) variance in program performance on consecutive
> runs.
> * I see a minor hit in performance adding -threaded, but going from -N1
> to -N4 (even with a single-thread of work) yields a big hit in performance
> and increase in variance.
> * -N4 with four actual threads of work is actually pretty good for the
> pure haskell version. All four threads on my nehalem 3.33ghz can maintain
> 93% of their throughput in the serial case. BUT the variance problem
> persists.
> * I run a busy-wait loop that measures cpu frequency... and this seems to
> get messed up in threaded mode (even with -qm -qa). I don't know why.
> * I cannot killThread a haskell thread (forkIO or forkOS) that is
> currently running a divergent FFI call (safe or unsafe). (See "time_c".)
>
> You can find the details in the DEVLOG here:
>
> https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG
>
> Let me know if you have any ideas. I'm going to leave the Haskell version
> how it is and focus on wrapping the Intel asm (which has a permissive
> license).
>
> Cheers,
> -Ryan
>
> P.S. Regarding this benchmarking -- would it be appropriate to use
> Criterion for this? Or is it sufficient to measure aggregate throughput as
> I've been doing?
>
>
[Attachment #5 (text/html)]
Hi Cafe,<br><br>I've included Gladman's efficient, portable C implementation \
of AES generating random numbers now, and the hardware-accelerated version is working \
too. I'm currently seeing higher throughput for even the software based version \
than the builtin stdGen:<br>
<br><span style="font-family: courier new,monospace;"> First, timing with \
System.Random interface:</span><span style="font-family: courier \
new,monospace;"></span><br style="font-family: courier new,monospace;"><span \
style="font-family: courier new,monospace;"> 13,051,964 random ints generated \
[System.Random stdGen] ~ 252 cycles/int</span><br style="font-family: courier \
new,monospace;">
<span style="font-family: courier new,monospace;"> 15,635 random ints \
generated [PureHaskell/reference] ~ 210,763 cycles/int</span><br \
style="font-family: courier new,monospace;"><span style="font-family: courier \
new,monospace;"> 31,159 random ints generated [PureHaskell] ~ \
105,757 cycles/int</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 2,180,488 random ints \
generated [Gladman inefficient] ~ 1,511 cycles/int</span><br \
style="font-family: courier new,monospace;"><span style="font-family: courier \
new,monospace;"> 15,015,095 random ints generated [Gladman] ~ \
219 cycles/int</span><br style="font-family: courier new,monospace;">
<br>That seems like a good argument for cryptographic RNGs to me!<br><br>I'm \
having a lot of trouble getting cabal to build/install it successfully. You can see \
what I've got there now. I'd be interested to know if anyone else can build \
it successfully. It should work -- but only by building the assembly code into a .so \
and assuming the build directory is /opt/intel-aes ;-).<br>
<br>I don't have real numbers for the hardware version yet because the Westmere \
machine I'm logged into is redhat 5.4 and is giving me "GLIBC_2.7 not \
found" errors. You can run it for correctness purposes using an emulation tool \
called sde <a href="http://software.intel.com/en-us/articles/intel-software-development-emulator/">(software \
development emulator)</a> that's based on dynamic binary translation.<br>
<br>-Ryan<br><br>P.S. Checkout command:<br><span style="font-family: courier \
new,monospace;">git clone git://<a \
href="http://github.com/rrnewton/intel-aes.git">github.com/rrnewton/intel-aes.git</a></span><br><br><br><br>
<br>
<div class="gmail_quote">On Sat, Jan 29, 2011 at 8:52 AM, Ryan Newton <span \
dir="ltr"><<a href="mailto:rrnewton@gmail.com">rrnewton@gmail.com</a>></span> \
wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; \
border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" \
style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); \
padding-left: 1ex;"> perhaps performance? Is this approach less robust with a \
faster,<br> non-cryptographic RNG?<br></blockquote></div><div><br>Yes, I don't \
understand that either. Is there a reason that using a weaker PRNG in this case is \
WORSE than using it in the non-splitting case? Is that why there is more of an \
impetus to use the cryptographic approach in this case?<br>
<br>Anyway, taking for granted that the Burton approach is a useful thing to have \
implemented, I started developing a package for this stuff -- AES based RNG including \
both a haskell implementation and wrapping an AESNI-based C one . I haven't \
posted it to Hackage yet, but you can find the git repository here:<br>
<br> <a href="https://github.com/rrnewton/intel-aes" \
target="_blank">https://github.com/rrnewton/intel-aes</a><br><br>If you build with \
cabal and run the benchmark-intel-aes-rng executable, it will give you a breakdown \
like this:<br>
<br>
<span style="font-family: courier new,monospace;"> How many random numbers can we \
generate in a second on one thread?</span><br> <span style="font-family: courier \
new,monospace;"> Cost of rdtsc (ffi call): 83</span><br style="font-family: \
courier new,monospace;">
<span style="font-family: courier new,monospace;"> Approx getCPUTime calls per \
second: 205,640</span><br style="font-family: courier new,monospace;"><span \
style="font-family: courier new,monospace;"> Approx clock frequency: \
3,306,891,339</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> First, timing with \
System.Random interface:</span><br style="font-family: courier new,monospace;"><span \
style="font-family: courier new,monospace;"> 193,178,901 random ints generated \
[constant zero gen] </span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 14,530,358 random ints \
generated [System.Random stdGen] </span><br style="font-family: courier \
new,monospace;"><span style="font-family: courier new,monospace;"> 16,346 \
random ints generated [BurtonGenSlow/reference] </span><br style="font-family: \
courier new,monospace;">
<span style="font-family: courier new,monospace;"> 32,965 random ints \
generated [BurtonGenSlow] </span><br style="font-family: courier \
new,monospace;"><span style="font-family: courier new,monospace;"> Comparison to \
C's rand():</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 118,766,285 random ints \
generated [rand/store in C loop]</span><br style="font-family: courier \
new,monospace;"><span style="font-family: courier new,monospace;"> 114,668,028 \
random ints generated [rand / Haskell loop]</span><br style="font-family: courier \
new,monospace;">
<span style="font-family: courier new,monospace;"> 114,675,116 random ints \
generated [rand/store Haskell] </span><br><br>At the moment this is Haskell-only, I \
haven't included the wrapped Intel assembly code library yet. As you can see, \
the pure-Haskell AES based RNG (BurtonGenSlow) is pretty slow.<br>
<br>Would anyone else be interested in running those RNG testing tools (diehard, big \
crush) on this to make sure it is working correctly?<br><br>Also I'd be happy if \
anyone with a performance-oriented-eye would like to take a look at what's going \
on. Both for the sake of the serial performance (above) and because the parallel \
performance is currently <b>mysterious</b> (see below).<br>
<br>I figure one of the main reasons for splittable RNG is deterministic parallel \
computations. Thus it's important that all threads be able to run the RNG \
efficiently. Right now, if you look at SimpleRNGBench.hs I'm just running the \
same RNG on numCapabilities threads. Yet with that simple test I'm running into \
problems, summarized thus:<br>
<br> * I get substantial (3X) variance in program performance on consecutive \
runs.<br> * I see a minor hit in performance adding -threaded, but going from -N1 to \
-N4 (even with a single-thread of work) yields a big hit in performance and increase \
in variance.<br>
* -N4 with four actual threads of work is actually pretty good for the pure haskell \
version. All four threads on my nehalem 3.33ghz can maintain 93% of their throughput \
in the serial case. BUT the variance problem persists.<br>
* I run a busy-wait loop that measures cpu frequency... and this seems to get \
messed up in threaded mode (even with -qm -qa). I don't know why.<br> * I \
cannot killThread a haskell thread (forkIO or forkOS) that is currently running a \
divergent FFI call (safe or unsafe). (See "time_c".)<br>
<br>You can find the details in the DEVLOG here:<br><br> <a \
href="https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG" \
target="_blank">https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG</a><br><br>Let \
me know if you have any ideas. I'm going to leave the Haskell version how it is \
and focus on wrapping the Intel asm (which has a permissive license).<br>
<br>Cheers,<br><font color="#888888"> -Ryan<br>
</font><br>P.S. Regarding this benchmarking -- would it be appropriate to use \
Criterion for this? Or is it sufficient to measure aggregate throughput as I've \
been doing?<br><br></div></div>
</blockquote></div><br>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic