'Re: [Kstars-devel] RFC: KStars GSOC: data pipelining and OpenCL.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kstars-devel
Subject:    Re: [Kstars-devel] RFC: KStars GSOC: data pipelining and OpenCL.
From:       Henry de Valence <hdevalence () gmail ! com>
Date:       2013-04-19 20:05:45
Message-ID: CAOnR5UoPqyzcWR0bEw=YuvUue1JhOVxD=dxh6WduFFXP11+1XA () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Hi all, sorry for my absence; it's near the end of term and I've been quite
busy.

One thing I'd like to point out is that OpenCL isn't really about graphics
processors, it's a way to structure embarassingly parallel problems like
the ones in KStars. You can run OpenCL on a CPU, a GPU, an APU, an FPGA,
some weird DSP thing, .... whatever.

Even for people who use no GPU at all, the OpenCL code lets you run across
multiple cores with no extra effort. Moving to OpenCL means moving away
from the inefficent OO data-processing approach we use now, towards a more
functional, parallellizable approach, so the data representation has to
change to match, and we should obviously change it to work with
quaternions.

I don't see the point of rewriting the KStars processing code completely
just so that we get to where everyone else is. We should rewrite it
properly, so that it works better now on CPU hardware, and beats everyone
else for the common case where the computer has a CL-enabled GPU. I think
that in the case where we have the most possible parallelism (displaying
lots and lots of stars) and we have a GPU, we should aim for 100x speedup,
not 10x.

My rough plan is to change the internal structure of the SkyPoint class to
use quaternions internally, but keeping the existing API as wrappers (Of
course, this initially slows everything down, since now you have to do trig
to access, not just calculate with, the coordinates). Then, move most of
the calculation functions for the SkyPoint out of the SkyPoint class and
rewrite them as to operate on buffers of quaternion vectors, and finally
move through all of the sky components and rewrite them to use the new
calculation functions instead of the old, slow ones, processing all of the
objects for the particular component in a single pass, rather than doing
one calculation per object.

Ideally you would remove all references to ra/dec/eq/hor coordinates for
anything, but I think that changing the top 95% of the calls (by time)
would work well enough, especially since we will get a speed boost from
using multiple cores.

Cheers, Henry

On Sat, Apr 13, 2013 at 6:40 PM, Aleksey Khudyakov <
alexey.skladnoy@gmail.com> wrote:

> On 14 April 2013 02:04, Akarsh Simha <akarshsimha@gmail.com> wrote:
> >> AFAIR conversions of coordinates is not worst bottleneck. Last time I
> checked (1
> >> or 2 years ago) drawing of constellation lines and borders and
> coordinates grid
> >> very much to my surprise. Any proposals to improve performance must be
> backed up
> >> with profiling/benchmarks. Otherwise it's too easy to fall into trap of
> >> optimizing wrong thing.
> >
> > Even with the USNO NOMAD catalog? That is a bit hard to believe,
> > although it might be the case. With the USNO NOMAD catalog, KStars
> > crawls when zoomed in on Sagittarius.
> >
> Without. That's valid point. Also how frequently do we need to update
> horizontal
> coordinates? For every star in memory on each time step? If so it's
> huge time sink
> too.
>
> Another advantage of quaternion approach is immutability. We do not
> need to modify
> coordinates of star except possibly to account for proper motion. Code
> shall become
> simpler too
>
>
> >> Furthermore we can get ~10x performance boost (uneducated guess) by
> changing
> >> representation of sky point. Currently it's represented by two angles
> and
> >> conversions between different coordinate systems are quite costly: 5 or
> 6 calls
> >> to trigonometry functions.
> >>
> >> Much more convenient scheme is to store points as 3D vectors with unit
> norm and
> >> some flag to distinguish between coordinate systems. In this case
> >> transformations between different coordinate systems could be done using
> >> quaternions and are cheap (15 multiplications). So there are no reason
> to cache
> >> horizontal coordinates, they could be recalculated on the fly if
> desired.  Most
> >> of the projections also become cheaper since they don't involve
> trigonometry in
> >> this representation.
> >>
> >>
> >> This has been discussed on mail list before. You can search using
> "quaternion"
> >> keyword
> >
> > Yeah, quaternions are certainly a good idea. Not sure Henry can fit it
> > into his time-line?
> >
> In my opinition it absolutely must be fitted there. If there isn't enough
> time
> drop OpenCL part. Reasons are simple:
>
>  1. We are going to change representation of stars/deep-skyes/whatever
>     anyway. Then we should change it to the most efficient one.
>
>  2. It's possible to render sky on CPU with LOT of start. Other people did
>     just that. So we should try to get good CPU performance first in order
>     to avoid penalizing people which couldn't use GPU for whatever
>     reason.
>
>  3. It's not clear that processing on GPU is clear win. Sure even low end
>     GPUs are order of magnitude faster. But... if workload maps on
> execution
>     scheme of GPU nicely if we won't saturate bus if any other unforeseen
>     problem won't surface.
>
>  4. We could probably hope for 10-20x speedup in ideal case. If we can
>     get similar speedup by using right algorithms we should do this. If
>     this isn't enough then we need to get big hammer (GPU in this case)
> _______________________________________________
> Kstars-devel mailing list
> Kstars-devel@kde.org
> https://mail.kde.org/mailman/listinfo/kstars-devel
>

[Attachment #5 (text/html)]

<div dir="ltr">Hi all, sorry for my absence; it&#39;s near the end of term and \
I&#39;ve been quite busy.<div><br></div><div>One thing I&#39;d like to point out is \
that <span class="" style>OpenCL</span> isn&#39;t really about graphics processors, \
it&#39;s a way to structure <span class="" style>embarassingly</span> parallel \
problems like the ones in <span class="" style>KStars</span>. You can run <span \
class="" style>OpenCL</span> on a CPU, a <span class="" style>GPU</span>, an <span \
class="" style>APU</span>, an <span class="" style>FPGA</span>, some weird <span \
class="" style>DSP</span> thing, .... whatever.</div>

<div><br></div><div>Even for people who use no <span class="" style>GPU</span> at \
all, the <span class="" style>OpenCL</span> code lets you run across multiple cores \
with no extra effort.  Moving to <span class="" style>OpenCL</span> means moving away \
from the <span class="" style>inefficent</span> <span class="" style>OO</span> \
data-processing approach we use now, towards a more functional, <span class="" \
style>parallellizable</span> approach, so the data representation has to change to \
match, and we should obviously change it to work with quaternions.  </div>

<div><br></div><div>I don&#39;t see the point of rewriting the <span class="" \
style>KStars</span> processing code completely just so that we get to where everyone \
else is. We should rewrite it properly, so that it works better now on CPU hardware, \
and beats everyone else for the common case where the computer has a CL-enabled <span \
class="" style>GPU</span>. I think that in the case where we have the most possible \
parallelism (displaying lots and lots of stars) and we have a <span class="" \
style>GPU</span>, we should aim for 100x speedup, not 10x.  </div>

<div><br></div><div>My rough plan is to change the internal structure of the <span \
class="" style>SkyPoint</span> class to use quaternions internally, but keeping the \
existing API as wrappers (Of course, this initially slows everything down, since now \
you have to do trig to access, not just calculate with, the coordinates). Then, move \
most of the calculation functions for the <span class="" style>SkyPoint</span> out of \
the <span class="" style>SkyPoint</span> class and rewrite them as to operate on \
buffers of quaternion vectors, and finally move through all of the sky components and \
rewrite them to use the new calculation functions instead of the old, slow ones, \
processing all of the objects for the particular component in a single pass, rather \
than doing one calculation per object.</div>

<div><br></div><div>Ideally you would remove all references to <span class="" \
style>ra</span>/<span class="" style>dec</span>/<span class="" style>eq</span>/<span \
class="" style>hor</span> coordinates for anything, but I think that changing the top \
95% of the calls (by time) would work well enough, especially since we will get a \
speed boost from using multiple cores.</div>

<div><br></div><div>Cheers, \
Henry</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div \
class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Apr 13, 2013 at 6:40 PM, \
Aleksey Khudyakov <span dir="ltr">&lt;<a href="mailto:alexey.skladnoy@gmail.com" \
target="_blank">alexey.skladnoy@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div class="im">On 14 April 2013 02:04, Akarsh Simha &lt;<a \
href="mailto:akarshsimha@gmail.com">akarshsimha@gmail.com</a>&gt; wrote:<br>

&gt;&gt; AFAIR conversions of coordinates is not worst bottleneck. Last time I \
checked (1<br> &gt;&gt; or 2 years ago) drawing of constellation lines and borders \
and coordinates grid<br> &gt;&gt; very much to my surprise. Any proposals to improve \
performance must be backed up<br> &gt;&gt; with profiling/benchmarks. Otherwise \
it&#39;s too easy to fall into trap of<br> &gt;&gt; optimizing wrong thing.<br>
&gt;<br>
&gt; Even with the USNO NOMAD catalog? That is a bit hard to believe,<br>
&gt; although it might be the case. With the USNO NOMAD catalog, KStars<br>
&gt; crawls when zoomed in on Sagittarius.<br>
&gt;<br>
</div>Without. That&#39;s valid point. Also how frequently do we need to update \
horizontal<br> coordinates? For every star in memory on each time step? If so \
it&#39;s<br> huge time sink<br>
too.<br>
<br>
Another advantage of quaternion approach is immutability. We do not<br>
need to modify<br>
coordinates of star except possibly to account for proper motion. Code<br>
shall become<br>
simpler too<br>
<div class="im"><br>
<br>
&gt;&gt; Furthermore we can get ~10x performance boost (uneducated guess) by \
changing<br> &gt;&gt; representation of sky point. Currently it&#39;s represented by \
two angles and<br> &gt;&gt; conversions between different coordinate systems are \
quite costly: 5 or 6 calls<br> &gt;&gt; to trigonometry functions.<br>
&gt;&gt;<br>
&gt;&gt; Much more convenient scheme is to store points as 3D vectors with unit norm \
and<br> &gt;&gt; some flag to distinguish between coordinate systems. In this \
case<br> &gt;&gt; transformations between different coordinate systems could be done \
using<br> &gt;&gt; quaternions and are cheap (15 multiplications). So there are no \
reason to cache<br> &gt;&gt; horizontal coordinates, they could be recalculated on \
the fly if desired.   Most<br> &gt;&gt; of the projections also become cheaper since \
they don&#39;t involve trigonometry in<br> &gt;&gt; this representation.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; This has been discussed on mail list before. You can search using \
&quot;quaternion&quot;<br> &gt;&gt; keyword<br>
&gt;<br>
&gt; Yeah, quaternions are certainly a good idea. Not sure Henry can fit it<br>
&gt; into his time-line?<br>
&gt;<br>
</div>In my opinition it absolutely must be fitted there. If there isn&#39;t enough \
time<br> drop OpenCL part. Reasons are simple:<br>
<br>
  1. We are going to change representation of stars/deep-skyes/whatever<br>
      anyway. Then we should change it to the most efficient one.<br>
<br>
  2. It&#39;s possible to render sky on CPU with LOT of start. Other people did<br>
      just that. So we should try to get good CPU performance first in order<br>
      to avoid penalizing people which couldn&#39;t use GPU for whatever<br>
      reason.<br>
<br>
  3. It&#39;s not clear that processing on GPU is clear win. Sure even low end<br>
      GPUs are order of magnitude faster. But... if workload maps on execution<br>
      scheme of GPU nicely if we won&#39;t saturate bus if any other unforeseen<br>
      problem won&#39;t surface.<br>
<br>
  4. We could probably hope for 10-20x speedup in ideal case. If we can<br>
      get similar speedup by using right algorithms we should do this. If<br>
      this isn&#39;t enough then we need to get big hammer (GPU in this case)<br>
_______________________________________________<br>
Kstars-devel mailing list<br>
<a href="mailto:Kstars-devel@kde.org">Kstars-devel@kde.org</a><br>
<a href="https://mail.kde.org/mailman/listinfo/kstars-devel" \
target="_blank">https://mail.kde.org/mailman/listinfo/kstars-devel</a><br> \
</blockquote></div><br></div>

_______________________________________________
Kstars-devel mailing list
Kstars-devel@kde.org
https://mail.kde.org/mailman/listinfo/kstars-devel

[prev in list] [next in list] [prev in thread] [next in thread]