[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: UDSEntry compression ideas
From:       Mark <markg85 () gmail ! com>
Date:       2013-09-20 19:31:03
Message-ID: CAPd6JnEiHLiMbywy6k3TWwVvT7e66qKXtvoYP7uuD4c+NJJguw () mail ! gmail ! com
[Download RAW message or body]

On Fri, Sep 20, 2013 at 10:29 AM, Frank Reininghaus <
frank78ac@googlemail.com> wrote:

> Hi Mark,
>
> 2013/9/20 Mark:
> [...]
> >> > The easiest "compression" that I could think of was to force the
> >> > UDSEntries to implicitly share all their QStrings for one field if
> >> > they are all equal:
> >> >
> >> > http://paste.kde.org/p52a24b49/
> >> >
> >> > Reduces the memory consumption in Details View for 100,000 items from
> >> > 160.5 MB to 147 MB here. The patch is a bit hackish though, and it
> >> > might have a small performance penalty. Maybe it's worth investigating
> >> > that further though.
> >>
> >> OK, so I did some performance measurements. Short version: my earlier
> >> patch does make listing directories slightly slower, but I found
> >> another way to hack UDSEntry that does apparently not make things
> >> slower, and that reduces the memory usage by about 450 bytes per
> >> listed file, without any loss of functionality.
> >>
> >> Long version:
> >>
> >> (a) First of all, I took the current master branches of kdelibs and
> >> kde-baseapps (all built in release mode) and added some profiling
> >> output:
> >>
> >> http://paste.kde.org/p9b9b7b68/
> >> http://paste.kde.org/p992af44c/
> >>
> >> Then I opened a folder with 100,000 files in Dolphin in Icons View.
> >>
> >> According to KSysGuard, the memory usage of the dolphin process is 200.6
> >> MB.
> >>
> >> For the performance measurements, I discarded the first run and went
> >> out of the folder and back twice to get reproducible results (the time
> >> spent per item in the first run fluctuated strongly).
> >>
> >> The time that SlaveInterface::dispatch(int _cmd, const QByteArray
> >> &rawdata) needs per item was between 2.6 and 2.8 microseconds.
> >>
> >>
> >> (b) My first patch that only makes use of implicit sharing of the
> >> QStrings:
> >>
> >> http://paste.kde.org/p52a24b49/
> >>
> >> Memory usage drops to 187.9 MB, but the time spent per item increases
> >> to 2.9 to 3.1 microseconds (not surprising, I think).
> >>
> >>
> >> (c) My new patch, that makes more intrusive changes to the way the
> >> data is stored inside UDSEntry (explanations below):
> >>
> >> http://paste.kde.org/pc85e6b34/
> >>
> >> Memory usage is now 156.8 MB, and the time per item 2.6 to 2.9
> >> microseconds.
> >>
> >>
> >> This means that we can save more than 450 bytes for each file, which
> >> is IMHO a lot.
> >>
> >>
> >>
> >> How does it work?
> >>
> >> The basic observation is that UDSEntry currently stores all data in a
> >> QHash<uint, Field>, and that a QHash is not really space-efficient
> >> (see, e.g., http://doc.qt.digia.com/qq/qq19-containers.html to get an
> >> idea how much memory is needed to store the data inside a QHash).
> >> However, it has the nice property that the "Field" that an "uint" is
> >> mapped to can be accessed in O(1) time.
> >>
> >> My idea to improve this and keep the O(1) access is the following: We
> >> store all fields in UDSEntryPrivate in a simple
> >>
> >> QVector<Field> fields
> >>
> >> in the order in which the fields are read from the stream. We remember
> >> which UDS field comes at which position and store this in a
> >>
> >> QHash<uint, int> udsIndexes
> >>
> >> (uint is the type of the "UDS field", and a QVector uses "int" for
> >> indexing, but there are still a few signed/unsigned warnings with my
> >> patch which need investigation).
> >>
> >> Then we can access a "Field" that corresponds to a uint "uds" with
> >>
> >> d->fields.at(d->udsIndexes.value(field)).
> >>
> >> Now the obvious question is: "How on earth can this be more efficient
> >> that the current solution? There is still a QHash in each
> >> UDSEntryPrivate after all, and the code gets more complicated!"
> >>
> >> The trick is that the mapping from the "uint uds" to the index of the
> >> corresponding "Field" is usually exactly the same for every UDSEntry
> >> if we read many of them in a row, and we can make all UDSEntryPrivates
> >> share this mapping (QHash is implicitly shared). This means that every
> >> UDSEntryPrivate only has a private QVector<Field>, which only needs a
> >> rather small overhead compared to the memory that the Fields
> >> themselves need.
> >>
> >> The patch definitely needs further testing and polishing. I probably
> >> won't get much work done until next week though.
> >>
> >> Best regards,
> >> Frank
> >
> >
> > Hi Frank,
> >
> > That is really stunning! I sadly don't have a proper testcase to test it
> on
> > frameworks.
> > So i made a very quick rough test app that simply stores all entries in
> > memory.
> >
> > Doing that i get these very impressive results:
> > PRE:
> > 500.000 files: 236,208MB
> >
> > POST (Frank):
> > 500.000 files: 118,148MB
> >
> > You can remove another ~14MB in both cases which the app just has with
> other
> > stuff going one.
> >
> > Hoewver, looking at your new patch i can't help but notice that the hash
> is
> > now only used as a lookup table. Perhaps you can just get rid of that now
> > and simply loop through the vector to find the right "Field" object? You
> > probably have to add a uint field to the Field struct in that case
> otherwise
> > you wouldn't know the field key. I guess it would take a bit more CPU
> power
> > since you would have to loop through all of QVector to find a falue. On
> the
> > other hand, you just have to loop over 8 items.. Perhaps that time is
> even
> > faster then a hash lookup even if that's O(1), it still has to do the
> hash
> > calculation some other stuff as opposed to just iterating over a vector
> and
> > comparing if the uint key (an int!) matches.
>
> Yes, the QHash is just a lookup table, but removing it won't help
> anything in terms of memory usage. Note that it's most likely
> implicitly shared with loads of other UDSEntryPrivate structs, so the
> amortized memory cost per UDSEntry is just a little more than the size
> of a pointer (8 bytes on a 64-bit system).
>
> OTOH, if you store a uint with each field, that requires
>
> "number of fields" * 4 bytes (size of unit) + overhead
>
> If you store the uints in a separate container, you have a certain
> constant overhead. If you store them inside the "Field" struct, the
> overhead will probably be 4 bytes per field on a 64-bit system because
> currently the "Field" contains a QString (effectively a pointer, i.e.,
> 8 bytes) and a long long (8 bytes). If you add a 4 byte uint to that
> struct, the compiler will add 4 extra bytes to the struct to preserve
> the 8 byte alignment of the pointers [1].
>
> Moreover, I'm not sure if it's safe to assume that each UDSEntry will
> only contain 8 items. There might be kioslaves which add a lot more
> now or in the future, so I'm not sure if sacrificing the O(1) access
> is a good idea.
>
> > Btw. This stuff just shows again how freaking smart you are
>
> Thanks :-) But there are also lots of programming-related areas where
> I'm a lot less smart than other people. And I should also say that
> this blog post by Milian made me aware of the implicit sharing idea, I
> don't think I would have had such ideas without reading it:
>
> http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1
>
> The only small flaw in the KMail use case is that the "implicit
> sharing pool" can in principle grow indefinitely if it's being fed
> with different strings all the time. But fortunately, this cannot
> happen with the implicit sharing implementation in my UDSEntry patch.I
> wouldn't be surprised if there were many more such opportunities in
> Qt-based applications and libraries.
>
> Cheers,
> Frank
>
> [1] By the way, a cool way to check such things quickly is
> http://gcc.godbolt.org/. Just paste the code below into the "Code
> editor", and you can immediately see the size of the different structs
> in the assembly output:
>
> struct test1
> {
>   void *v;
> };
>
> struct test2
> {
>   void *v;
>   int i;
> };
>
> struct test3
> {
>   void *v;
>   int i;
>   int j;  // Depending on the platform, 'j' will not need any
> additional space compared to 'test2'.
> };
>
> int size1()
> {
>   return sizeof(test1);
> }
>
> int size2()
> {
>   return sizeof(test2);
> }
>
> int size3()
> {
>   return sizeof(test3);
> }
>

That stuff is really cool!

Now i see a few more possible optimizations. I wonder what your opinion is
on those.

1. Instead of storing the string as QString, you can also store it as a
QByteArray and call squeeze after setting string. The places that returns a
QString simply constructs it QString(...)
2. Use the google "sparse_hash_map", It's extremely memory efficient!
https://code.google.com/p/sparsehash/ reading how it is implemented id also
very interesting:
https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html that
was another "WOW!" moment for me when reading that. The downside, it's
slower.
3. Right now the Field struct stores a QString and a long long. Certainly
there must be a way to just use one type. I guess QVariant has too much
overhead to be used, but something like that.

Just my remaining ideas that i can't even implement :) (ok, 1 i can, 2 i
can likely, but 3.. nope. That requires templating knowledge that i simply
don't have if possible at all)

[Attachment #3 (text/html)]

<div dir="ltr">On Fri, Sep 20, 2013 at 10:29 AM, Frank Reininghaus <span \
dir="ltr">&lt;<a href="mailto:frank78ac@googlemail.com" \
target="_blank">frank78ac@googlemail.com</a>&gt;</span> wrote:<br><div \
class="gmail_extra"><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi \
Mark,<br> <br>
2013/9/20 Mark:<br>
[...]<br>
<div><div class="h5">&gt;&gt; &gt; The easiest &quot;compression&quot; that \
I could think of was to force the<br> &gt;&gt; &gt; UDSEntries to \
implicitly share all their QStrings for one field if<br> &gt;&gt; &gt; they \
are all equal:<br> &gt;&gt; &gt;<br>
&gt;&gt; &gt; <a href="http://paste.kde.org/p52a24b49/" \
target="_blank">http://paste.kde.org/p52a24b49/</a><br> &gt;&gt; &gt;<br>
&gt;&gt; &gt; Reduces the memory consumption in Details View for 100,000 \
items from<br> &gt;&gt; &gt; 160.5 MB to 147 MB here. The patch is a bit \
hackish though, and it<br> &gt;&gt; &gt; might have a small performance \
penalty. Maybe it&#39;s worth investigating<br> &gt;&gt; &gt; that further \
though.<br> &gt;&gt;<br>
&gt;&gt; OK, so I did some performance measurements. Short version: my \
earlier<br> &gt;&gt; patch does make listing directories slightly slower, \
but I found<br> &gt;&gt; another way to hack UDSEntry that does apparently \
not make things<br> &gt;&gt; slower, and that reduces the memory usage by \
about 450 bytes per<br> &gt;&gt; listed file, without any loss of \
functionality.<br> &gt;&gt;<br>
&gt;&gt; Long version:<br>
&gt;&gt;<br>
&gt;&gt; (a) First of all, I took the current master branches of kdelibs \
and<br> &gt;&gt; kde-baseapps (all built in release mode) and added some \
profiling<br> &gt;&gt; output:<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/p9b9b7b68/" \
target="_blank">http://paste.kde.org/p9b9b7b68/</a><br> &gt;&gt; <a \
href="http://paste.kde.org/p992af44c/" \
target="_blank">http://paste.kde.org/p992af44c/</a><br> &gt;&gt;<br>
&gt;&gt; Then I opened a folder with 100,000 files in Dolphin in Icons \
View.<br> &gt;&gt;<br>
&gt;&gt; According to KSysGuard, the memory usage of the dolphin process is \
200.6<br> &gt;&gt; MB.<br>
&gt;&gt;<br>
&gt;&gt; For the performance measurements, I discarded the first run and \
went<br> &gt;&gt; out of the folder and back twice to get reproducible \
results (the time<br> &gt;&gt; spent per item in the first run fluctuated \
strongly).<br> &gt;&gt;<br>
&gt;&gt; The time that SlaveInterface::dispatch(int _cmd, const \
QByteArray<br> &gt;&gt; &amp;rawdata) needs per item was between 2.6 and \
2.8 microseconds.<br> &gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; (b) My first patch that only makes use of implicit sharing of \
the<br> &gt;&gt; QStrings:<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/p52a24b49/" \
target="_blank">http://paste.kde.org/p52a24b49/</a><br> &gt;&gt;<br>
&gt;&gt; Memory usage drops to 187.9 MB, but the time spent per item \
increases<br> &gt;&gt; to 2.9 to 3.1 microseconds (not surprising, I \
think).<br> &gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; (c) My new patch, that makes more intrusive changes to the way \
the<br> &gt;&gt; data is stored inside UDSEntry (explanations below):<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://paste.kde.org/pc85e6b34/" \
target="_blank">http://paste.kde.org/pc85e6b34/</a><br> &gt;&gt;<br>
&gt;&gt; Memory usage is now 156.8 MB, and the time per item 2.6 to 2.9<br>
&gt;&gt; microseconds.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; This means that we can save more than 450 bytes for each file, \
which<br> &gt;&gt; is IMHO a lot.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; How does it work?<br>
&gt;&gt;<br>
&gt;&gt; The basic observation is that UDSEntry currently stores all data \
in a<br> &gt;&gt; QHash&lt;uint, Field&gt;, and that a QHash is not really \
space-efficient<br> &gt;&gt; (see, e.g., <a \
href="http://doc.qt.digia.com/qq/qq19-containers.html" \
target="_blank">http://doc.qt.digia.com/qq/qq19-containers.html</a> to get \
an<br> &gt;&gt; idea how much memory is needed to store the data inside a \
QHash).<br> &gt;&gt; However, it has the nice property that the \
&quot;Field&quot; that an &quot;uint&quot; is<br> &gt;&gt; mapped to can be \
accessed in O(1) time.<br> &gt;&gt;<br>
&gt;&gt; My idea to improve this and keep the O(1) access is the following: \
We<br> &gt;&gt; store all fields in UDSEntryPrivate in a simple<br>
&gt;&gt;<br>
&gt;&gt; QVector&lt;Field&gt; fields<br>
&gt;&gt;<br>
&gt;&gt; in the order in which the fields are read from the stream. We \
remember<br> &gt;&gt; which UDS field comes at which position and store \
this in a<br> &gt;&gt;<br>
&gt;&gt; QHash&lt;uint, int&gt; udsIndexes<br>
&gt;&gt;<br>
&gt;&gt; (uint is the type of the &quot;UDS field&quot;, and a QVector uses \
&quot;int&quot; for<br> &gt;&gt; indexing, but there are still a few \
signed/unsigned warnings with my<br> &gt;&gt; patch which need \
investigation).<br> &gt;&gt;<br>
&gt;&gt; Then we can access a &quot;Field&quot; that corresponds to a uint \
&quot;uds&quot; with<br> &gt;&gt;<br>
&gt;&gt; d-&gt;<a href="http://fields.at" \
target="_blank">fields.at</a>(d-&gt;udsIndexes.value(field)).<br> \
&gt;&gt;<br> &gt;&gt; Now the obvious question is: &quot;How on earth can \
this be more efficient<br> &gt;&gt; that the current solution? There is \
still a QHash in each<br> &gt;&gt; UDSEntryPrivate after all, and the code \
gets more complicated!&quot;<br> &gt;&gt;<br>
&gt;&gt; The trick is that the mapping from the &quot;uint uds&quot; to the \
index of the<br> &gt;&gt; corresponding &quot;Field&quot; is usually \
exactly the same for every UDSEntry<br> &gt;&gt; if we read many of them in \
a row, and we can make all UDSEntryPrivates<br> &gt;&gt; share this mapping \
(QHash is implicitly shared). This means that every<br> &gt;&gt; \
UDSEntryPrivate only has a private QVector&lt;Field&gt;, which only needs \
a<br> &gt;&gt; rather small overhead compared to the memory that the \
Fields<br> &gt;&gt; themselves need.<br>
&gt;&gt;<br>
&gt;&gt; The patch definitely needs further testing and polishing. I \
probably<br> &gt;&gt; won&#39;t get much work done until next week \
though.<br> &gt;&gt;<br>
&gt;&gt; Best regards,<br>
&gt;&gt; Frank<br>
&gt;<br>
&gt;<br>
&gt; Hi Frank,<br>
&gt;<br>
&gt; That is really stunning! I sadly don&#39;t have a proper testcase to \
test it on<br> &gt; frameworks.<br>
&gt; So i made a very quick rough test app that simply stores all entries \
in<br> &gt; memory.<br>
&gt;<br>
&gt; Doing that i get these very impressive results:<br>
&gt; PRE:<br>
&gt; 500.000 files: 236,208MB<br>
&gt;<br>
&gt; POST (Frank):<br>
&gt; 500.000 files: 118,148MB<br>
&gt;<br>
&gt; You can remove another ~14MB in both cases which the app just has with \
other<br> &gt; stuff going one.<br>
&gt;<br>
&gt; Hoewver, looking at your new patch i can&#39;t help but notice that \
the hash is<br> &gt; now only used as a lookup table. Perhaps you can just \
get rid of that now<br> &gt; and simply loop through the vector to find the \
right &quot;Field&quot; object? You<br> &gt; probably have to add a uint \
field to the Field struct in that case otherwise<br> &gt; you wouldn&#39;t \
know the field key. I guess it would take a bit more CPU power<br> &gt; \
since you would have to loop through all of QVector to find a falue. On \
the<br> &gt; other hand, you just have to loop over 8 items.. Perhaps that \
time is even<br> &gt; faster then a hash lookup even if that&#39;s O(1), it \
still has to do the hash<br> &gt; calculation some other stuff as opposed \
to just iterating over a vector and<br> &gt; comparing if the uint key (an \
int!) matches.<br> <br>
</div></div>Yes, the QHash is just a lookup table, but removing it \
won&#39;t help<br> anything in terms of memory usage. Note that it&#39;s \
most likely<br> implicitly shared with loads of other UDSEntryPrivate \
structs, so the<br> amortized memory cost per UDSEntry is just a little \
more than the size<br> of a pointer (8 bytes on a 64-bit system).<br>
<br>
OTOH, if you store a uint with each field, that requires<br>
<br>
&quot;number of fields&quot; * 4 bytes (size of unit) + overhead<br>
<br>
If you store the uints in a separate container, you have a certain<br>
constant overhead. If you store them inside the &quot;Field&quot; struct, \
the<br> overhead will probably be 4 bytes per field on a 64-bit system \
because<br> currently the &quot;Field&quot; contains a QString (effectively \
a pointer, i.e.,<br> 8 bytes) and a long long (8 bytes). If you add a 4 \
byte uint to that<br> struct, the compiler will add 4 extra bytes to the \
struct to preserve<br> the 8 byte alignment of the pointers [1].<br>
<br>
Moreover, I&#39;m not sure if it&#39;s safe to assume that each UDSEntry \
will<br> only contain 8 items. There might be kioslaves which add a lot \
more<br> now or in the future, so I&#39;m not sure if sacrificing the O(1) \
access<br> is a good idea.<br>
<div class="im"><br>
&gt; Btw. This stuff just shows again how freaking smart you are<br>
<br>
</div>Thanks :-) But there are also lots of programming-related areas \
where<br> I&#39;m a lot less smart than other people. And I should also say \
that<br> this blog post by Milian made me aware of the implicit sharing \
idea, I<br> don&#39;t think I would have had such ideas without reading \
it:<br> <br>
<a href="http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1" \
target="_blank">http://milianw.de/blog/katekdevelop-sprint-vienna-2012-take-1</a><br>
 <br>
The only small flaw in the KMail use case is that the &quot;implicit<br>
sharing pool&quot; can in principle grow indefinitely if it&#39;s being \
fed<br> with different strings all the time. But fortunately, this \
cannot<br> happen with the implicit sharing implementation in my UDSEntry \
patch.I<br> wouldn&#39;t be surprised if there were many more such \
opportunities in<br> Qt-based applications and libraries.<br>
<br>
Cheers,<br>
Frank<br>
<br>
[1] By the way, a cool way to check such things quickly is<br>
<a href="http://gcc.godbolt.org/" \
target="_blank">http://gcc.godbolt.org/</a>. Just paste the code below into \
the &quot;Code<br> editor&quot;, and you can immediately see the size of \
the different structs<br> in the assembly output:<br>
<br>
struct test1<br>
{<br>
  void *v;<br>
};<br>
<br>
struct test2<br>
{<br>
  void *v;<br>
  int i;<br>
};<br>
<br>
struct test3<br>
{<br>
  void *v;<br>
  int i;<br>
  int j;  // Depending on the platform, &#39;j&#39; will not need any<br>
additional space compared to &#39;test2&#39;.<br>
};<br>
<br>
int size1()<br>
{<br>
  return sizeof(test1);<br>
}<br>
<br>
int size2()<br>
{<br>
  return sizeof(test2);<br>
}<br>
<br>
int size3()<br>
{<br>
  return sizeof(test3);<br>
}<br></blockquote><div><br></div><div>That stuff is really \
cool!</div><div><br></div><div>Now i see a few more possible optimizations. \
I wonder what your opinion is on those.</div><div><br></div><div>1. Instead \
of storing the string as QString, you can also store it as a QByteArray and \
call squeeze after setting string. The places that returns a QString simply \
constructs it QString(...)</div>

<div>2. Use the google &quot;sparse_hash_map&quot;, It&#39;s extremely \
memory efficient! <a \
href="https://code.google.com/p/sparsehash/">https://code.google.com/p/sparsehash/</a> \
reading how it is implemented id also very interesting: <a \
href="https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html">https://sparsehash.googlecode.com/svn/trunk/doc/implementation.html</a> \
that was another &quot;WOW!&quot; moment for me when reading that. The \
downside, it&#39;s slower.</div>

<div>3. Right now the Field struct stores a QString and a long long. \
Certainly there must be a way to just use one type. I guess QVariant has \
too much overhead to be used, but something like \
that.</div><div><br></div><div>

Just my remaining ideas that i can&#39;t even implement :) (ok, 1 i can, 2 \
i can likely, but 3.. nope. That requires templating knowledge that i \
simply don&#39;t have if possible at all)</div></div></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic