[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-kimageshop
Subject:    Some notes about the directions of optimization of image merging in Krita
From:       Dmitry Kazakov <dimula73 () gmail ! com>
Date:       2013-04-12 16:58:29
Message-ID: CAEkBSfUTK76UzJiHdKg0BwrfkYHsSQBrZW4n7EeWb6=gsqcm-g () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hi, all!

Recently I've been experimenting on optimization of Krita work with
huge multilayer images (including some more aggressive
multithreading), so I'd like to share some ideas I got about it.

Thre are three general approaches we can adopt here:

1) Multithreading at the level of the KisUpdateScheduler (we do
already have it). The general idea is that huge update regions are
split into smaller rectangles and each rect is merged separately in
its own thread.

2) Multithreading at the level of KisPainter. Each bitBlt (or
bitBltFixed) operation can split its work region on smaller rects and
process each rect in a separate thread. I guess, Sven did some
experiments on this topic some time ago, but I don't know the outcome
of it.

3) Avoid bitBlt of the empty tiles (the tiles filled with default (and
transparent) pixel).


Results:

1,2) [common things] When doing the measurements I found a very
interesting thing. It looks like the implementation of the QMutex and,
therefore, QReadWriteLock in Qt <= 4.7 is really flawed. The mutex there is
completely not scalable. It works *only* at the thread count <=
2(!). The raising of the number of threads higher than 2 gives
performance degradation to the level worse than the we get with a
single thread.

Qt >= 4.8 does not have this flaw. Its implementation scales with the
number of threads and the speed gets a bit higher.

As a result of it, I tried to implement my own read-write lock, which
would not rely on QMutex that much. The tests showed that my
implementation solved (a bit of) the problem in Qt 4.7, but the new
version of the mutexes in Qt 4.8 surpass it by about 15-20% [0]. So, I
think, I should drop that idea and just limit the number of used
threads when Krita is running on Qt <= 4.7.

1,2) [differences] In general, both types of the multithreading are
useful in Krita for different usecases. The threading at the level 1)
(scheduler) is useful when we have some filters or masks in the
stack. In this case the filters and filter masks' code will also use
the benefits of the threads. It can also give some benefit with usual
layers, but the update area should be quite huge (512+ px wide) so
that the scheduler could split it into smaller rectangles. That is
exactly the case when we do full refresh of the image (e.g. when
changing the visibility of the layer).

The threading at the level 2) (that is KisPainter) will not give any
benefit for the filters, but it can give some bonuses when painting
on usual layers, when the update rects are smaller than 512 pixels.

In my benchmarks I measured the speed of the full refresh of a huge
image (about 4000x6000 px) containing about 20 layers. It turned out
that the speed of the refresh is proportional (non-linearly) to the
total *sum* of the threads currently present, that is both ways of the
threading affect the speed, although the scheduler is a bit more
efficient in this [full refresh] testcase. And the good thing is that
there is not much overhead created: the speed of the refresh with 6+6
threads is only 5% slower than the refresh with 6 scheduler's threads
(of course this is applicable to Qt 4.8 only).

As a result, I think, adding the threads to the level of the
KisPainter is a good idea, because the scheduler can not cover all the
usecases.

But still, there is something to be desired in the threading code. The
usage of 4-8 threads makes speed boost of about 2 times (although the
portion of parallel code in this testcase is almost 100%) . I cannot
test what is happening, because I have Qt 4.8 only on a virtual
machine and I cannot run VTune there.

3) Ok, now about avoiding empty tiles. Everything is simple here. I
tested this approach and it gives about 2 times speed boost almost for
free! I just need to implement it in a proper way: expand the
interface of the data manager a bit and add some general iteration
classes to the KisPainter.


[0] - http://wstaw.org/m/2013/04/12/plasma-desktopSV2476.jpg



---
Dmitry Kazakov

[Attachment #5 (text/html)]

Hi, all!<br><br>Recently I&#39;ve been experimenting on optimization of Krita work \
with<br>huge multilayer images (including some more aggressive<br>multithreading), so \
I&#39;d like to share some ideas I got about it.<br><br> Thre are three general \
approaches we can adopt here:<br><br>1) Multithreading at the level of the \
KisUpdateScheduler (we do<br>already have it). The general idea is that huge update \
regions are<br>split into smaller rectangles and each rect is merged separately \
in<br> its own thread.<br><br>2) Multithreading at the level of KisPainter. Each \
bitBlt (or<br>bitBltFixed) operation can split its work region on smaller rects \
and<br>process each rect in a separate thread. I guess, Sven did some<br> experiments \
on this topic some time ago, but I don&#39;t know the outcome<br>of it.<br><br>3) \
Avoid bitBlt of the empty tiles (the tiles filled with default (and<br>transparent) \
pixel).<br><br><br>Results:<br><br>1,2) [common things] When doing the measurements I \
found a very<br> interesting thing. It looks like the implementation of the QMutex \
and,<br>therefore, QReadWriteLock in Qt &lt;= 4.7 is really flawed. The mutex there \
is<br>completely not scalable. It works *only* at the thread count &lt;=<br> 2(!). \
The raising of the number of threads higher than 2 gives<br>performance degradation \
to the level worse than the we get with a<br>single thread.<br><br>Qt &gt;= 4.8 does \
not have this flaw. Its implementation scales with the<br> number of threads and the \
speed gets a bit higher.<br><br>As a result of it, I tried to implement my own \
read-write lock, which<br>would not rely on QMutex that much. The tests showed that \
my<br>implementation solved (a bit of) the problem in Qt 4.7, but the new<br> version \
of the mutexes in Qt 4.8 surpass it by about 15-20% [0]. So, I<br>think, I should \
drop that idea and just limit the number of used<br>threads when Krita is running on \
Qt &lt;= 4.7.<br><br>1,2) [differences] In general, both types of the multithreading \
are<br> useful in Krita for different usecases. The threading at the level \
1)<br>(scheduler) is useful when we have some filters or masks in the<br>stack. In \
this case the filters and filter masks&#39; code will also use<br>the benefits of the \
threads. It can also give some benefit with usual<br> layers, but the update area \
should be quite huge (512+ px wide) so<br>that the scheduler could split it into \
smaller rectangles. That is<br>exactly the case when we do full refresh of the image \
(e.g. when<br>changing the visibility of the layer).<br> <br>The threading at the \
level 2) (that is KisPainter) will not give any<br>benefit for the filters, but it \
can give some bonuses when painting<br>on usual layers, when the update rects are \
smaller than 512 pixels.<br><br> In my benchmarks I measured the speed of the full \
refresh of a huge<br>image (about 4000x6000 px) containing about 20 layers. It turned \
out<br>that the speed of the refresh is proportional (non-linearly) to the<br>total \
*sum* of the threads currently present, that is both ways of the<br> threading affect \
the speed, although the scheduler is a bit more<br>efficient in this [full refresh] \
testcase. And the good thing is that<br>there is not much overhead created: the speed \
of the refresh with 6+6<br>threads is only 5% slower than the refresh with 6 \
scheduler&#39;s threads<br> (of course this is applicable to Qt 4.8 only).<br><br>As \
a result, I think, adding the threads to the level of the<br>KisPainter is a good \
idea, because the scheduler can not cover all the<br>usecases.<br><br>But still, \
there is something to be desired in the threading code. The<br> usage of 4-8 threads \
makes speed boost of about 2 times (although the<br>portion of parallel code in this \
testcase is almost 100%) . I cannot<br>test what is happening, because I have Qt 4.8 \
only on a virtual<br>machine and I cannot run VTune there.<br> <br>3) Ok, now about \
avoiding empty tiles. Everything is simple here. I<br>tested this approach and it \
gives about 2 times speed boost almost for<br>free! I just need to implement it in a \
proper way: expand the<br>interface of the data manager a bit and add some general \
iteration<br> classes to the KisPainter.<br><br><br>[0] - <a \
href="http://wstaw.org/m/2013/04/12/plasma-desktopSV2476.jpg">http://wstaw.org/m/2013/04/12/plasma-desktopSV2476.jpg</a><br><br><br><br>--- \
<br>Dmitry Kazakov



_______________________________________________
Krita mailing list
kimageshop@kde.org
https://mail.kde.org/mailman/listinfo/kimageshop


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic