[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Review Request 120438: improve baloo file indexing overhead by ~2.5x
From:       "Aaron J. Seigo" <aseigo () kde ! org>
Date:       2015-02-27 14:08:02
Message-ID: 20150227140802.12883.85283 () probe ! kde ! org
[Download RAW message or body]

--===============6708298779760466460==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/120438/
-----------------------------------------------------------

(Updated Feb. 27, 2015, 2:08 p.m.)


Status
------

This change has been discarded.


Review request for Baloo.


Repository: baloo


Description
-------

Current Baloo's file indexer takes batches of files by its command line and processes \
them async. There are several iterations over and creation of various collection \
classes. This patch set does away with all of this.

Instead, the worker process now is directed over stdin and responds over stdout (old \
school UNIX!). This is done using a very simple protocol which is fully documented in \
docs/ and is encapsulated in a convenience class so as to make future maintenance \
relatively painless in the advent that the protocol is adjusted or an entirely new \
system is put inplace. IOW, the next me should have an easier time improving on what \
is there. This allows the worker process to be long(er)-lived and moves more of the \
relevant bookkeeping to the same place (the worker process). The resulting code is \
about the same size but cleaner and much more performant.

The tracking of file progress is also moved to the (less expensive, easier to update \
incrementally) sqlite db where file data is already kept. It also versions the sqlite \
schema so that runtime upgrades can be made (as required here) both now and in the \
future without user intervention.

Finally, a tiny (93 LOC, including header) executable that wraps the protocol \
encapsulation class is provided for 100% backwards compatibility of *all* command \
line options that currently exist.

This makes it a 100% drop-in replacement.

Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 3.0 \
seconds.

Known weaknesses:

* still isn't hardened against multiple processes indexing simultaneously. the added \
                sqlite columns are a step in this direction, howeer
* the DBus signal that was being emitted was removed and has not been replaced with \
anything yet. That is because the affected class is in baloo-widgets (which happens \
to be the only user of the relevant *public* class in the baloo file library) and I \
have not yet worked on that, pending this patch set being approved in principle. I am \
actually of the suspicion that keeping that data synchronized at all times is not \
needed .. but if it is, there are much better ways than DBus. Note that the DBus \
overhead was applied to *every single file indexed* even if Dolphin (the only user of \
this feature) wasn't running or had the metadata frame open.

The base idea behind this patch is to make file indexing as blazing fast as possible, \
which means using as few resources, CPU & memory, as possible. That ought to be the \
#1 priority for Baloo when it comes to indexing files, both to keep it away from \
user's notice on desktop class hardware as well as extend its relevance to smaller \
devices. The code in this patch is at least as clear (imho more so, actually) than \
what it replaces and was written for maintainability.


Diffs
-----

  src/file/extractor/app.cpp PRE-CREATION 
  src/file/extractor/baloo_file_extractor.h a8cff51ae763c36fb2934c80c04cbf0f710a01c7 
  src/file/extractor/baloo_file_extractor.cpp \
b2e7f97f2f0537807c314e2aca008d67711a69ce   src/file/extractor/extractorworker.cpp \
e710908dc8b02b7edc9f40e44b9d5d5883986846   src/file/extractor/main.cpp \
606a1eb9d42b7079fee497a3fdffbe2327512c13   src/file/fileindexingjob.h PRE-CREATION 
  src/file/fileindexingjob.cpp PRE-CREATION 
  src/file/fileindexingqueue.h 5b1c612feb84046acf36b3dec3bf890da2cbf650 
  src/file/fileindexingqueue.cpp c6479cd25308bee9bb6fe6e6791391e6ca83bddf 
  src/file/indexingqueue.cpp c8ae4d9d1436ebe222a4758bd5ce551e091ec714 
  src/file/lib/CMakeLists.txt fb158a5f00f07ab48de963bf72924c83fc1c6673 
  src/file/lib/extractorclient.h 1e2ae09c7bd0068ca185b814c5c4996c0b563856 
  src/file/lib/extractorclient.cpp 221b90abe1899c27c03cf77e279fb71afce80b4d 
  src/file/lib/filemapping.cpp 5769fe23576c341917e8f37f7dd2743dd446a6e1 
  src/file/main.cpp 41855215c7b550c403d5ef523d29f9b73db58cb0 
  src/file/metadatamover.cpp 5117687c1247275235c3830b9d2234bfff353748 
  src/file/util.h a4cc2423fd46cda7146b02c572cc3f4ccd069e3e 
  src/file/util.cpp 2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336 
  src/file/autotest/fileindexingjob/fileindexingjobtest.h PRE-CREATION 
  src/file/autotest/fileindexingjob/fileindexingjobtest.cpp PRE-CREATION 
  src/file/autotest/scheduler/schedulertest.cpp \
b91a481138382b609068b4737c654ac5182c37ec   src/file/basicindexingjob.h \
3d2fb339e38be862332ddca45b4926f8000fa0c1   src/file/basicindexingjob.cpp \
8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84   src/file/basicindexingqueue.cpp \
8484c9b6868890cae6c2e19b29ea92d500aac03e   src/file/commitqueue.cpp \
b26f73cfdbc390cceef01aa6abd2b5af6f9689b2   src/file/database.h \
6548de7dbce003112dad8c58c39c624dd806a261   src/file/database.cpp \
5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f   src/file/extractor/CMakeLists.txt \
7cf3478f547f7fbcdf8908a3471b50fab09b5220   src/file/extractor/app.h \
                8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de 
  .gitignore 1377554ebea6f98a2c748183bc5a96852af12ac2 
  docs/file_extractor.md 3d830e2313bd35994f9fd1d6b52df422d0556fd8 
  src/file/CMakeLists.txt 1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622 
  src/file/autotest/CMakeLists.txt ccb8575fa8c07f2b136d259ed63d43a09ad75a12 
  src/file/autotest/basicindexingjobtest.h PRE-CREATION 
  src/file/autotest/basicindexingjobtest.cpp PRE-CREATION 
  src/file/autotest/fileindexingjob/CMakeLists.txt PRE-CREATION 
  src/file/autotest/fileindexingjob/extractor.cpp PRE-CREATION 

Diff: https://git.reviewboard.kde.org/r/120438/diff/


Testing
-------

Indexed files, ran testsuite with 0 failures.


Thanks,

Aaron J. Seigo


--===============6708298779760466460==
MIME-Version: 1.0
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 7bit




<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="12" style="border: 1px #c9c399 \
solid; border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  \
<tr>  <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="https://git.reviewboard.kde.org/r/120438/">https://git.reviewboard.kde.org/r/120438/</a>
  </td>
    </tr>
   </table>
   <br />



<table bgcolor="#e0e0e0" width="100%" cellpadding="12" style="border: 1px gray solid; \
border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  <tr>
  <td>
   <h1 style="margin: 0; padding: 0; font-size: 10pt;">This change has been \
discarded.</h1>  </td>
 </tr>
</table>
<br />


<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="12" style="border: \
1px #888a85 solid; border-radius: 6px; -moz-border-radius: 6px; \
-webkit-border-radius: 6px;">  <tr>
  <td>

<div>Review request for Baloo.</div>
<div>By Aaron J. Seigo.</div>


<p style="color: grey;"><i>Updated Feb. 27, 2015, 2:08 p.m.</i></p>









<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
baloo
</div>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;"><p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Current Baloo's file indexer takes batches of files by \
its command line and processes them async. There are several iterations over and \
creation of various collection classes. This patch set does away with all of \
this.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Instead, the worker process now is directed over stdin \
and responds over stdout (old school UNIX!). This is done using a very simple \
protocol which is fully documented in docs/ and is encapsulated in a convenience \
class so as to make future maintenance relatively painless in the advent that the \
protocol is adjusted or an entirely new system is put inplace. IOW, the next me \
should have an easier time improving on what is there. This allows the worker process \
to be long(er)-lived and moves more of the relevant bookkeeping to the same place \
(the worker process). The resulting code is about the same size but cleaner and much \
more performant.</p> <p style="padding: 0;text-rendering: inherit;margin: \
0;line-height: inherit;white-space: inherit;">The tracking of file progress is also \
moved to the (less expensive, easier to update incrementally) sqlite db where file \
data is already kept. It also versions the sqlite schema so that runtime upgrades can \
be made (as required here) both now and in the future without user intervention.</p> \
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Finally, a tiny (93 LOC, including header) executable \
that wraps the protocol encapsulation class is provided for 100% backwards \
compatibility of <em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">all</em> command line options that currently exist.</p> \
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">This makes it a 100% drop-in replacement.</p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v \
3.0 seconds.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Known weaknesses:</p> <ul style="padding: \
0;text-rendering: inherit;margin: 0 0 0 1em;line-height: inherit;white-space: \
normal;"> <li style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">still isn't hardened against multiple processes \
indexing simultaneously. the added sqlite columns are a step in this direction, \
howeer</li> <li style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">the DBus signal that was being emitted was removed and \
has not been replaced with anything yet. That is because the affected class is in \
baloo-widgets (which happens to be the only user of the relevant <em style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
normal;">public</em> class in the baloo file library) and I have not yet worked on \
that, pending this patch set being approved in principle. I am actually of the \
suspicion that keeping that data synchronized at all times is not needed .. but if it \
is, there are much better ways than DBus. Note that the DBus overhead was applied to \
<em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">every single file indexed</em> even if Dolphin (the \
only user of this feature) wasn't running or had the metadata frame open.</li> </ul>
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">The base idea behind this patch is to make file \
indexing as blazing fast as possible, which means using as few resources, CPU &amp; \
memory, as possible. That ought to be the #1 priority for Baloo when it comes to \
indexing files, both to keep it away from user's notice on desktop class hardware as \
well as extend its relevance to smaller devices. The code in this patch is at least \
as clear (imho more so, actually) than what it replaces and was written for \
maintainability.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: \
1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;"><p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Indexed files, ran testsuite with 0 \
failures.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/file/extractor/app.cpp <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/extractor/baloo_file_extractor.h <span style="color: \
grey">(a8cff51ae763c36fb2934c80c04cbf0f710a01c7)</span></li>

 <li>src/file/extractor/baloo_file_extractor.cpp <span style="color: \
grey">(b2e7f97f2f0537807c314e2aca008d67711a69ce)</span></li>

 <li>src/file/extractor/extractorworker.cpp <span style="color: \
grey">(e710908dc8b02b7edc9f40e44b9d5d5883986846)</span></li>

 <li>src/file/extractor/main.cpp <span style="color: \
grey">(606a1eb9d42b7079fee497a3fdffbe2327512c13)</span></li>

 <li>src/file/fileindexingjob.h <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingjob.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingqueue.h <span style="color: \
grey">(5b1c612feb84046acf36b3dec3bf890da2cbf650)</span></li>

 <li>src/file/fileindexingqueue.cpp <span style="color: \
grey">(c6479cd25308bee9bb6fe6e6791391e6ca83bddf)</span></li>

 <li>src/file/indexingqueue.cpp <span style="color: \
grey">(c8ae4d9d1436ebe222a4758bd5ce551e091ec714)</span></li>

 <li>src/file/lib/CMakeLists.txt <span style="color: \
grey">(fb158a5f00f07ab48de963bf72924c83fc1c6673)</span></li>

 <li>src/file/lib/extractorclient.h <span style="color: \
grey">(1e2ae09c7bd0068ca185b814c5c4996c0b563856)</span></li>

 <li>src/file/lib/extractorclient.cpp <span style="color: \
grey">(221b90abe1899c27c03cf77e279fb71afce80b4d)</span></li>

 <li>src/file/lib/filemapping.cpp <span style="color: \
grey">(5769fe23576c341917e8f37f7dd2743dd446a6e1)</span></li>

 <li>src/file/main.cpp <span style="color: \
grey">(41855215c7b550c403d5ef523d29f9b73db58cb0)</span></li>

 <li>src/file/metadatamover.cpp <span style="color: \
grey">(5117687c1247275235c3830b9d2234bfff353748)</span></li>

 <li>src/file/util.h <span style="color: \
grey">(a4cc2423fd46cda7146b02c572cc3f4ccd069e3e)</span></li>

 <li>src/file/util.cpp <span style="color: \
grey">(2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/scheduler/schedulertest.cpp <span style="color: \
grey">(b91a481138382b609068b4737c654ac5182c37ec)</span></li>

 <li>src/file/basicindexingjob.h <span style="color: \
grey">(3d2fb339e38be862332ddca45b4926f8000fa0c1)</span></li>

 <li>src/file/basicindexingjob.cpp <span style="color: \
grey">(8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84)</span></li>

 <li>src/file/basicindexingqueue.cpp <span style="color: \
grey">(8484c9b6868890cae6c2e19b29ea92d500aac03e)</span></li>

 <li>src/file/commitqueue.cpp <span style="color: \
grey">(b26f73cfdbc390cceef01aa6abd2b5af6f9689b2)</span></li>

 <li>src/file/database.h <span style="color: \
grey">(6548de7dbce003112dad8c58c39c624dd806a261)</span></li>

 <li>src/file/database.cpp <span style="color: \
grey">(5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f)</span></li>

 <li>src/file/extractor/CMakeLists.txt <span style="color: \
grey">(7cf3478f547f7fbcdf8908a3471b50fab09b5220)</span></li>

 <li>src/file/extractor/app.h <span style="color: \
grey">(8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de)</span></li>

 <li>.gitignore <span style="color: \
grey">(1377554ebea6f98a2c748183bc5a96852af12ac2)</span></li>

 <li>docs/file_extractor.md <span style="color: \
grey">(3d830e2313bd35994f9fd1d6b52df422d0556fd8)</span></li>

 <li>src/file/CMakeLists.txt <span style="color: \
grey">(1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622)</span></li>

 <li>src/file/autotest/CMakeLists.txt <span style="color: \
grey">(ccb8575fa8c07f2b136d259ed63d43a09ad75a12)</span></li>

 <li>src/file/autotest/basicindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/basicindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/CMakeLists.txt <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/extractor.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

</ul>

<p><a href="https://git.reviewboard.kde.org/r/120438/diff/" style="margin-left: \
3em;">View Diff</a></p>






  </td>
 </tr>
</table>



  </div>
 </body>
</html>


--===============6708298779760466460==--


[Attachment #3 (text/plain)]


>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic