[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: UDSEntry compression ideas
From:       Mark <markg85 () gmail ! com>
Date:       2013-09-20 19:31:03
Message-ID: CAPd6JnEiHLiMbywy6k3TWwVvT7e66qKXtvoYP7uuD4c+NJJguw () mail ! gmail ! com
[Download RAW message or body]

On Fri, Sep 20, 2013 at 10:29 AM, Frank Reininghaus <
frank78ac@googlemail.com> wrote:

> Hi Mark,
>
> 2013/9/20 Mark:
> [...]
> >> > The easiest "compression" that I could think of was to force the
> >> > UDSEntries to implicitly share all their QStrings for one field if
> >> > they are all equal:
> >> >
> >> > http://paste.kde.org/p52a24b49/
> >> >
> >> > Reduces the memory consumption in Details View for 100,000 items from
> >> > 160.5 MB to 147 MB here. The patch is a bit hackish though, and it
> >> > might have a small performance penalty. Maybe it's worth investigating
> >> > that further though.
> >>
> >> OK, so I did some performance measurements. Short version: my earlier
> >> patch does make listing directories slightly slower, but I found
> >> another way to hack UDSEntry that does apparently not make things
> >> slower, and that reduces the memory usage by about 450 bytes per
> >> listed file, without any loss of functionality.
> >>
> >> Long version:
> >>
> >> (a) First of all, I took the current master branches of kdelibs and
> >> kde-baseapps (all built in release mode) and added some profiling
> >> output:
> >>
> >> http://paste.kde.org/p9b9b7b68/
> >> http://paste.kde.org/p992af44c/
> >>
> >> Then I opened a folder with 100,000 files in Dolphin in Icons View.
> >>
> >> According to KSysGuard, the memory usage of the dolphin process is 200.6
> >> MB.
> >>
> >> For the performance measurements, I discarded the first run and went
> >> out of the folder and back twice to get reproducible results (the time
> >> spent per item in the first run fluctuated strongly).
> >>
> >> The time that SlaveInterface::dispatch(int _cmd, const QByteArray
> >> &rawdata) needs per item was between 2.6 and 2.8 microseconds.
> >>
> >>
> >> (b) My first patch that only makes use of implicit sharing of the
> >> QStrings:
> >>
> >> http://paste.kde.org/p52a24b49/
> >>
> >> Memory usage drops to 187.9 MB, but the time spent per item increases
> >> to 2.9 to 3.1 microseconds (not surprising, I think).
> >>
> >>
> >> (c) My new patch, that makes more intrusive changes to the way the
> >> data is stored inside UDSEntry (explanations below):
> >>
> >> http://paste.kde.org/pc85e6b34/
> >>
> >> Memory usage is now 156.8 MB, and the time per item 2.6 to 2.9
> >> microseconds.
> >>
> >>
> >> This means that we can save more than 450 bytes for each file, which
> >> is IMHO a lot.
> >>
> >>
> >>
> >> How does it work?
> >>
> >> The basic observation is that UDSEntry currently stores all data in a
> >> QHash<uint, Field>, and that a QHash is not really space-efficient
> >> (see, e.g., http://doc.qt.digia.com/qq/qq19-containers.html to get an
> >> idea how much memory is needed to store the data inside a QHash).
> >> However, it has the nice property that the "Field" that an "uint" is
> >> mapped to can be accessed in O(1) time.
> >>
> >> My idea to improve this and keep the O(1) access is the following: We
> >> store all fields in UDSEntryPrivate in a simple
> >>
> >> QVector<Field> fields
> >>
> >> in the order in which the fields are read from the stream. We remember
> >> which UDS field comes at which position and store this in a
> >>
> >> QHash<uint, int> udsIndexes
> >>
> >> (uint is the type of the "UDS field", and a QVector uses "int" for
> >> indexing, but there are still a few signed/unsigned warnings with my
> >> patch which need investigation).
> >>
> >> Then we can access a "Field" that corresponds to a uint "uds" with
> >>
> >> d->fields.at(d->udsIndexes.value(field)).
> >>
> >> Now the obvious question is: "How on earth can this be more efficient
> >> that the current solution? There is still a QHash in each
> >> UDSEntryPrivate after all, and the code gets more complicated!"
> >>
> >> The trick is that the mapping from the "uint uds" to the index of the
> >> corresponding "Field" is usually exactly the same for every UDSEntry
> >> if we read many of them in a row, and we can make all UDSEntryPrivates
> >> share this mapping (QHash is implicitly shared). This means that every
> >> UDSEntryPrivate only has a private QVector<Field>, which only needs a
> >> rather small overhead compared to the memory that the Fields
> >> themselves need.
> >>
> >> The patch definitely needs further testing and polishing. I probably
> >> won't get much work done until next week though.
> >>
> >> Best regards,
> >> Frank
> >
> >
> > Hi Frank,
> >
> > That is really stunning! I sadly don't have a proper testcase to test it
> on
> > frameworks.
> > So i made a very quick rough test app that simply stores all entries in
> > memory.
> >
> > Doing that i get these very impressive results:
> > PRE:
> > 500.000 files: 236,208MB
> >
> > POST (Frank):
> > 500.000 files: 118,148MB
> >
> > You can remove another ~14MB in both cases which the app just has with
> other
> > stuff going one.
> >
> > Hoewver, looking at your new patch i can't help but notice that the hash
> is
> > now only used as a lookup table. Perhaps you can just get rid of that now
> > and simply loop through the vector to find the right "Field" object? You
> > probably have to add a uint field to the Field struct in that case
> otherwise
> > you wouldn't know the field key. I guess it would take a bit more CPU
> power
> > since you would have to loop through all of QVector to find a falue. On
> the
> > other hand, you just have to loop over 8 items.. Perhaps that time is
> even
> > faster then a hash lookup even if that's O(1), it still has to do the
> hash
> > calculation some other stuff as opposed to just iterating over a vector
> and
> > comparing if the uint key (an int!) matches.
>
> Yes, the QHash is just a lookup table, but removing it won't help
> anything in terms of memory usage. Note that it's most likely
> implicitly shared with loads of other UDSEntryPrivate structs, so the
> amortized memory cost per UDSEntry is just a little more than the size
> of a pointer (8 bytes on a 64-bit system).
>
> OTOH, if you store a uint with each field, that requires
>
> "number of fields" * 4 bytes (size of unit) + overhead
>
> If you store the uints in a separate container, you have a certain
> constant overhead. If you store them inside the "Field" struct, the
> overhead will probably be 4 bytes per field on a 64-bit system because
> currently the "Field" contains a QString (effectively a pointer, i.e.,
> 8 bytes) and a long long (8 bytes). If you add a 4 byte uint to that
> struct, the compiler will add 4 extra bytes to the struct to preserve
> the 8 byte alignment of the pointers [1].
>
> Moreover, I'm not sure if it's safe to assume that each UDSEntry will
> only contain 8 items. There might be kioslaves which add a lot more
> now or in the future, so I'm not sure if sacrificing the O(1) access
> is a good idea.
>
> > Btw. This stuff just shows again how freaking smart you are
>
> Thanks :-) But there are also lots of programming-related areas where
> I'm a lot less smart than other people. And I should also say that
> this blog post by Milian made me aware of the implicit sharing idea, I
> don't think I would have had such ideas without reading it:
>
> http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1
>
> The only small flaw in the KMail use case is that the "implicit
> sharing pool" can in principle grow indefinitely if it's being fed
> with different strings all the time. But fortunately, this cannot
> happen with the implicit sharing implementation in my UDSEntry patch.I
> wouldn't be surprised if there were many more such opportunities in
> Qt-based applications and libraries.
>
> Cheers,
> Frank
>
> [1] By the way, a cool way to check such things quickly is
> http://gcc.godbolt.org/. Just paste the code below into the "Code
> editor", and you can immediately see the size of the different structs
> in the assembly output:
>
> struct test1
> {
>   void *v;
> };
>
> struct test2
> {
>   void *v;
>   int i;
> };
>
> struct test3
> {
>   void *v;
>   int i;
>   int j;  // Depending on the platform, 'j' will not need any
> additional space compared to 'test2'.
> };
>
> int size1()
> {
>   return sizeof(test1);
> }
>
> int size2()
> {
>   return sizeof(test2);
> }
>
> int size3()
> {
>   return sizeof(test3);
> }
>

That stuff is really cool!

Now i see a few more possible optimizations. I wonder what your opinion is
on those.

1. Instead of storing the string as QString, you can also store it as a
QByteArray and call squeeze after setting string. The places that returns a
QString simply constructs it QString(...)
2. Use the google "sparse_hash_map", It's extremely memory efficient!
https://code.google.com/p/sparsehash/ reading how it is implemented id also
very interesting:
https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html that
was another "WOW!" moment for me when reading that. The downside, it's
slower.
3. Right now the Field struct stores a QString and a long long. Certainly
there must be a way to just use one type. I guess QVariant has too much
overhead to be used, but something like that.

Just my remaining ideas that i can't even implement :) (ok, 1 i can, 2 i
can likely, but 3.. nope. That requires templating knowledge that i simply
don't have if possible at all)

[Attachment #3 (text/html)]

<div dir="ltr">On Fri, Sep 20, 2013 at 10:29 AM, Frank Reininghaus <span \
dir="ltr">&lt;<a href="mailto:frank78ac@googlemail.com" \
target="_blank">frank78ac@googlemail.com</a>&gt;</span> wrote:<br><div \
class="gmail_extra"><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi \
Mark,<br> <br>
2013/9/20 Mark:<br>
[...]<br>
<div><div class="h5">&gt;&gt; &gt; The easiest &quot;compression&quot; that I could \
think of was to force the<br> &gt;&gt; &gt; UDSEntries to implicitly share all their \
QStrings for one field if<br> &gt;&gt; &gt; they are all equal:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; <a href="http://paste.kde.org/p52a24b49/" \
target="_blank">http://paste.kde.org/p52a24b49/</a><br> &gt;&gt; &gt;<br>
&gt;&gt; &gt; Reduces the memory consumption in Details View for 100,000 items \
from<br> &gt;&gt; &gt; 160.5 MB to 147 MB here. The patch is a bit hackish though, \
and it<br> &gt;&gt; &gt; might have a small performance penalty. Maybe it&#39;s worth \
investigating<br> &gt;&gt; &gt; that further though.<br>
&gt;&gt;<br>
&gt;&gt; OK, so I did some performance measurements. Short version: my earlier<br>
&gt;&gt; patch does make listing directories slightly slower, but I found<br>
&gt;&gt; another way to hack UDSEntry that does apparently not make things<br>
&gt;&gt; slower, and that reduces the memory usage by about 450 bytes per<br>
&gt;&gt; listed file, without any loss of functionality.<br>
&gt;&gt;<br>
&gt;&gt; Long version:<br>
&gt;&gt;<br>
&gt;&gt; (a) First of all, I took the current master branches of kdelibs and<br>
&gt;&gt; kde-baseapps (all built in release mode) and added some profiling<br>
&gt;&gt; output:<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/p9b9b7b68/" \
target="_blank">http://paste.kde.org/p9b9b7b68/</a><br> &gt;&gt; <a \
href="http://paste.kde.org/p992af44c/" \
target="_blank">http://paste.kde.org/p992af44c/</a><br> &gt;&gt;<br>
&gt;&gt; Then I opened a folder with 100,000 files in Dolphin in Icons View.<br>
&gt;&gt;<br>
&gt;&gt; According to KSysGuard, the memory usage of the dolphin process is 200.6<br>
&gt;&gt; MB.<br>
&gt;&gt;<br>
&gt;&gt; For the performance measurements, I discarded the first run and went<br>
&gt;&gt; out of the folder and back twice to get reproducible results (the time<br>
&gt;&gt; spent per item in the first run fluctuated strongly).<br>
&gt;&gt;<br>
&gt;&gt; The time that SlaveInterface::dispatch(int _cmd, const QByteArray<br>
&gt;&gt; &amp;rawdata) needs per item was between 2.6 and 2.8 microseconds.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; (b) My first patch that only makes use of implicit sharing of the<br>
&gt;&gt; QStrings:<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/p52a24b49/" \
target="_blank">http://paste.kde.org/p52a24b49/</a><br> &gt;&gt;<br>
&gt;&gt; Memory usage drops to 187.9 MB, but the time spent per item increases<br>
&gt;&gt; to 2.9 to 3.1 microseconds (not surprising, I think).<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; (c) My new patch, that makes more intrusive changes to the way the<br>
&gt;&gt; data is stored inside UDSEntry (explanations below):<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/pc85e6b34/" \
target="_blank">http://paste.kde.org/pc85e6b34/</a><br> &gt;&gt;<br>
&gt;&gt; Memory usage is now 156.8 MB, and the time per item 2.6 to 2.9<br>
&gt;&gt; microseconds.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; This means that we can save more than 450 bytes for each file, which<br>
&gt;&gt; is IMHO a lot.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; How does it work?<br>
&gt;&gt;<br>
&gt;&gt; The basic observation is that UDSEntry currently stores all data in a<br>
&gt;&gt; QHash&lt;uint, Field&gt;, and that a QHash is not really space-efficient<br>
&gt;&gt; (see, e.g., <a href="http://doc.qt.digia.com/qq/qq19-containers.html" \
target="_blank">http://doc.qt.digia.com/qq/qq19-containers.html</a> to get an<br> \
&gt;&gt; idea how much memory is needed to store the data inside a QHash).<br> \
&gt;&gt; However, it has the nice property that the &quot;Field&quot; that an \
&quot;uint&quot; is<br> &gt;&gt; mapped to can be accessed in O(1) time.<br>
&gt;&gt;<br>
&gt;&gt; My idea to improve this and keep the O(1) access is the following: We<br>
&gt;&gt; store all fields in UDSEntryPrivate in a simple<br>
&gt;&gt;<br>
&gt;&gt; QVector&lt;Field&gt; fields<br>
&gt;&gt;<br>
&gt;&gt; in the order in which the fields are read from the stream. We remember<br>
&gt;&gt; which UDS field comes at which position and store this in a<br>
&gt;&gt;<br>
&gt;&gt; QHash&lt;uint, int&gt; udsIndexes<br>
&gt;&gt;<br>
&gt;&gt; (uint is the type of the &quot;UDS field&quot;, and a QVector uses \
&quot;int&quot; for<br> &gt;&gt; indexing, but there are still a few signed/unsigned \
warnings with my<br> &gt;&gt; patch which need investigation).<br>
&gt;&gt;<br>
&gt;&gt; Then we can access a &quot;Field&quot; that corresponds to a uint \
&quot;uds&quot; with<br> &gt;&gt;<br>
&gt;&gt; d-&gt;<a href="http://fields.at" \
target="_blank">fields.at</a>(d-&gt;udsIndexes.value(field)).<br> &gt;&gt;<br>
&gt;&gt; Now the obvious question is: &quot;How on earth can this be more \
efficient<br> &gt;&gt; that the current solution? There is still a QHash in each<br>
&gt;&gt; UDSEntryPrivate after all, and the code gets more complicated!&quot;<br>
&gt;&gt;<br>
&gt;&gt; The trick is that the mapping from the &quot;uint uds&quot; to the index of \
the<br> &gt;&gt; corresponding &quot;Field&quot; is usually exactly the same for \
every UDSEntry<br> &gt;&gt; if we read many of them in a row, and we can make all \
UDSEntryPrivates<br> &gt;&gt; share this mapping (QHash is implicitly shared). This \
means that every<br> &gt;&gt; UDSEntryPrivate only has a private \
QVector&lt;Field&gt;, which only needs a<br> &gt;&gt; rather small overhead compared \
to the memory that the Fields<br> &gt;&gt; themselves need.<br>
&gt;&gt;<br>
&gt;&gt; The patch definitely needs further testing and polishing. I probably<br>
&gt;&gt; won&#39;t get much work done until next week though.<br>
&gt;&gt;<br>
&gt;&gt; Best regards,<br>
&gt;&gt; Frank<br>
&gt;<br>
&gt;<br>
&gt; Hi Frank,<br>
&gt;<br>
&gt; That is really stunning! I sadly don&#39;t have a proper testcase to test it \
on<br> &gt; frameworks.<br>
&gt; So i made a very quick rough test app that simply stores all entries in<br>
&gt; memory.<br>
&gt;<br>
&gt; Doing that i get these very impressive results:<br>
&gt; PRE:<br>
&gt; 500.000 files: 236,208MB<br>
&gt;<br>
&gt; POST (Frank):<br>
&gt; 500.000 files: 118,148MB<br>
&gt;<br>
&gt; You can remove another ~14MB in both cases which the app just has with other<br>
&gt; stuff going one.<br>
&gt;<br>
&gt; Hoewver, looking at your new patch i can&#39;t help but notice that the hash \
is<br> &gt; now only used as a lookup table. Perhaps you can just get rid of that \
now<br> &gt; and simply loop through the vector to find the right &quot;Field&quot; \
object? You<br> &gt; probably have to add a uint field to the Field struct in that \
case otherwise<br> &gt; you wouldn&#39;t know the field key. I guess it would take a \
bit more CPU power<br> &gt; since you would have to loop through all of QVector to \
find a falue. On the<br> &gt; other hand, you just have to loop over 8 items.. \
Perhaps that time is even<br> &gt; faster then a hash lookup even if that&#39;s O(1), \
it still has to do the hash<br> &gt; calculation some other stuff as opposed to just \
iterating over a vector and<br> &gt; comparing if the uint key (an int!) matches.<br>
<br>
</div></div>Yes, the QHash is just a lookup table, but removing it won&#39;t help<br>
anything in terms of memory usage. Note that it&#39;s most likely<br>
implicitly shared with loads of other UDSEntryPrivate structs, so the<br>
amortized memory cost per UDSEntry is just a little more than the size<br>
of a pointer (8 bytes on a 64-bit system).<br>
<br>
OTOH, if you store a uint with each field, that requires<br>
<br>
&quot;number of fields&quot; * 4 bytes (size of unit) + overhead<br>
<br>
If you store the uints in a separate container, you have a certain<br>
constant overhead. If you store them inside the &quot;Field&quot; struct, the<br>
overhead will probably be 4 bytes per field on a 64-bit system because<br>
currently the &quot;Field&quot; contains a QString (effectively a pointer, i.e.,<br>
8 bytes) and a long long (8 bytes). If you add a 4 byte uint to that<br>
struct, the compiler will add 4 extra bytes to the struct to preserve<br>
the 8 byte alignment of the pointers [1].<br>
<br>
Moreover, I&#39;m not sure if it&#39;s safe to assume that each UDSEntry will<br>
only contain 8 items. There might be kioslaves which add a lot more<br>
now or in the future, so I&#39;m not sure if sacrificing the O(1) access<br>
is a good idea.<br>
<div class="im"><br>
&gt; Btw. This stuff just shows again how freaking smart you are<br>
<br>
</div>Thanks :-) But there are also lots of programming-related areas where<br>
I&#39;m a lot less smart than other people. And I should also say that<br>
this blog post by Milian made me aware of the implicit sharing idea, I<br>
don&#39;t think I would have had such ideas without reading it:<br>
<br>
<a href="http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1" \
target="_blank">http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1</a><br> \
<br> The only small flaw in the KMail use case is that the &quot;implicit<br>
sharing pool&quot; can in principle grow indefinitely if it&#39;s being fed<br>
with different strings all the time. But fortunately, this cannot<br>
happen with the implicit sharing implementation in my UDSEntry patch.I<br>
wouldn&#39;t be surprised if there were many more such opportunities in<br>
Qt-based applications and libraries.<br>
<br>
Cheers,<br>
Frank<br>
<br>
[1] By the way, a cool way to check such things quickly is<br>
<a href="http://gcc.godbolt.org/" target="_blank">http://gcc.godbolt.org/</a>. Just \
paste the code below into the &quot;Code<br> editor&quot;, and you can immediately \
see the size of the different structs<br> in the assembly output:<br>
<br>
struct test1<br>
{<br>
  void *v;<br>
};<br>
<br>
struct test2<br>
{<br>
  void *v;<br>
  int i;<br>
};<br>
<br>
struct test3<br>
{<br>
  void *v;<br>
  int i;<br>
  int j;  // Depending on the platform, &#39;j&#39; will not need any<br>
additional space compared to &#39;test2&#39;.<br>
};<br>
<br>
int size1()<br>
{<br>
  return sizeof(test1);<br>
}<br>
<br>
int size2()<br>
{<br>
  return sizeof(test2);<br>
}<br>
<br>
int size3()<br>
{<br>
  return sizeof(test3);<br>
}<br></blockquote><div><br></div><div>That stuff is really \
cool!</div><div><br></div><div>Now i see a few more possible optimizations. I wonder \
what your opinion is on those.</div><div><br></div><div>1. Instead of storing the \
string as QString, you can also store it as a QByteArray and call squeeze after \
setting string. The places that returns a QString simply constructs it \
QString(...)</div>

<div>2. Use the google &quot;sparse_hash_map&quot;, It&#39;s extremely memory \
efficient! <a href="https://code.google.com/p/sparsehash/">https://code.google.com/p/sparsehash/</a> \
reading how it is implemented id also very interesting: <a \
href="https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html">https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html</a> \
that was another &quot;WOW!&quot; moment for me when reading that. The downside, \
it&#39;s slower.</div>

<div>3. Right now the Field struct stores a QString and a long long. Certainly there \
must be a way to just use one type. I guess QVariant has too much overhead to be \
used, but something like that.</div><div><br></div><div>

Just my remaining ideas that i can&#39;t even implement :) (ok, 1 i can, 2 i can \
likely, but 3.. nope. That requires templating knowledge that i simply don&#39;t have \
if possible at all)</div></div></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic