[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Review Request 120438: improve baloo file indexing overhead by ~2.5x
From:       "Aaron J. Seigo" <aseigo () kde ! org>
Date:       2015-02-27 14:08:02
Message-ID: 20150227140802.12883.85283 () probe ! kde ! org
[Download RAW message or body]

--===============6708298779760466460==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/120438/
-----------------------------------------------------------

(Updated Feb. 27, 2015, 2:08 p.m.)


Status
------

This change has been discarded.


Review request for Baloo.


Repository: baloo


Description
-------

Current Baloo's file indexer takes batches of files by its command line and processes them \
async. There are several iterations over and creation of various collection classes. This patch \
set does away with all of this.

Instead, the worker process now is directed over stdin and responds over stdout (old school \
UNIX!). This is done using a very simple protocol which is fully documented in docs/ and is \
encapsulated in a convenience class so as to make future maintenance relatively painless in the \
advent that the protocol is adjusted or an entirely new system is put inplace. IOW, the next me \
should have an easier time improving on what is there. This allows the worker process to be \
long(er)-lived and moves more of the relevant bookkeeping to the same place (the worker \
process). The resulting code is about the same size but cleaner and much more performant.

The tracking of file progress is also moved to the (less expensive, easier to update \
incrementally) sqlite db where file data is already kept. It also versions the sqlite schema so \
that runtime upgrades can be made (as required here) both now and in the future without user \
intervention.

Finally, a tiny (93 LOC, including header) executable that wraps the protocol encapsulation \
class is provided for 100% backwards compatibility of *all* command line options that currently \
exist.

This makes it a 100% drop-in replacement.

Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 3.0 seconds.

Known weaknesses:

* still isn't hardened against multiple processes indexing simultaneously. the added sqlite \
                columns are a step in this direction, howeer
* the DBus signal that was being emitted was removed and has not been replaced with anything \
yet. That is because the affected class is in baloo-widgets (which happens to be the only user \
of the relevant *public* class in the baloo file library) and I have not yet worked on that, \
pending this patch set being approved in principle. I am actually of the suspicion that keeping \
that data synchronized at all times is not needed .. but if it is, there are much better ways \
than DBus. Note that the DBus overhead was applied to *every single file indexed* even if \
Dolphin (the only user of this feature) wasn't running or had the metadata frame open.

The base idea behind this patch is to make file indexing as blazing fast as possible, which \
means using as few resources, CPU & memory, as possible. That ought to be the #1 priority for \
Baloo when it comes to indexing files, both to keep it away from user's notice on desktop class \
hardware as well as extend its relevance to smaller devices. The code in this patch is at least \
as clear (imho more so, actually) than what it replaces and was written for maintainability.


Diffs
-----

  src/file/extractor/app.cpp PRE-CREATION 
  src/file/extractor/baloo_file_extractor.h a8cff51ae763c36fb2934c80c04cbf0f710a01c7 
  src/file/extractor/baloo_file_extractor.cpp b2e7f97f2f0537807c314e2aca008d67711a69ce 
  src/file/extractor/extractorworker.cpp e710908dc8b02b7edc9f40e44b9d5d5883986846 
  src/file/extractor/main.cpp 606a1eb9d42b7079fee497a3fdffbe2327512c13 
  src/file/fileindexingjob.h PRE-CREATION 
  src/file/fileindexingjob.cpp PRE-CREATION 
  src/file/fileindexingqueue.h 5b1c612feb84046acf36b3dec3bf890da2cbf650 
  src/file/fileindexingqueue.cpp c6479cd25308bee9bb6fe6e6791391e6ca83bddf 
  src/file/indexingqueue.cpp c8ae4d9d1436ebe222a4758bd5ce551e091ec714 
  src/file/lib/CMakeLists.txt fb158a5f00f07ab48de963bf72924c83fc1c6673 
  src/file/lib/extractorclient.h 1e2ae09c7bd0068ca185b814c5c4996c0b563856 
  src/file/lib/extractorclient.cpp 221b90abe1899c27c03cf77e279fb71afce80b4d 
  src/file/lib/filemapping.cpp 5769fe23576c341917e8f37f7dd2743dd446a6e1 
  src/file/main.cpp 41855215c7b550c403d5ef523d29f9b73db58cb0 
  src/file/metadatamover.cpp 5117687c1247275235c3830b9d2234bfff353748 
  src/file/util.h a4cc2423fd46cda7146b02c572cc3f4ccd069e3e 
  src/file/util.cpp 2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336 
  src/file/autotest/fileindexingjob/fileindexingjobtest.h PRE-CREATION 
  src/file/autotest/fileindexingjob/fileindexingjobtest.cpp PRE-CREATION 
  src/file/autotest/scheduler/schedulertest.cpp b91a481138382b609068b4737c654ac5182c37ec 
  src/file/basicindexingjob.h 3d2fb339e38be862332ddca45b4926f8000fa0c1 
  src/file/basicindexingjob.cpp 8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84 
  src/file/basicindexingqueue.cpp 8484c9b6868890cae6c2e19b29ea92d500aac03e 
  src/file/commitqueue.cpp b26f73cfdbc390cceef01aa6abd2b5af6f9689b2 
  src/file/database.h 6548de7dbce003112dad8c58c39c624dd806a261 
  src/file/database.cpp 5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f 
  src/file/extractor/CMakeLists.txt 7cf3478f547f7fbcdf8908a3471b50fab09b5220 
  src/file/extractor/app.h 8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de 
  .gitignore 1377554ebea6f98a2c748183bc5a96852af12ac2 
  docs/file_extractor.md 3d830e2313bd35994f9fd1d6b52df422d0556fd8 
  src/file/CMakeLists.txt 1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622 
  src/file/autotest/CMakeLists.txt ccb8575fa8c07f2b136d259ed63d43a09ad75a12 
  src/file/autotest/basicindexingjobtest.h PRE-CREATION 
  src/file/autotest/basicindexingjobtest.cpp PRE-CREATION 
  src/file/autotest/fileindexingjob/CMakeLists.txt PRE-CREATION 
  src/file/autotest/fileindexingjob/extractor.cpp PRE-CREATION 

Diff: https://git.reviewboard.kde.org/r/120438/diff/


Testing
-------

Indexed files, ran testsuite with 0 failures.


Thanks,

Aaron J. Seigo


--===============6708298779760466460==
MIME-Version: 1.0
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 7bit




<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="12" style="border: 1px #c9c399 solid; \
border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  <tr>
     <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="https://git.reviewboard.kde.org/r/120438/">https://git.reviewboard.kde.org/r/120438/</a>
  </td>
    </tr>
   </table>
   <br />



<table bgcolor="#e0e0e0" width="100%" cellpadding="12" style="border: 1px gray solid; \
border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  <tr>
  <td>
   <h1 style="margin: 0; padding: 0; font-size: 10pt;">This change has been discarded.</h1>
  </td>
 </tr>
</table>
<br />


<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="12" style="border: 1px \
#888a85 solid; border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  <tr>
  <td>

<div>Review request for Baloo.</div>
<div>By Aaron J. Seigo.</div>


<p style="color: grey;"><i>Updated Feb. 27, 2015, 2:08 p.m.</i></p>









<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
baloo
</div>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px \
solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">Current Baloo's \
file indexer takes batches of files by its command line and processes them async. There are \
several iterations over and creation of various collection classes. This patch set does away \
with all of this.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Instead, the worker process now is directed over stdin and \
responds over stdout (old school UNIX!). This is done using a very simple protocol which is \
fully documented in docs/ and is encapsulated in a convenience class so as to make future \
maintenance relatively painless in the advent that the protocol is adjusted or an entirely new \
system is put inplace. IOW, the next me should have an easier time improving on what is there. \
This allows the worker process to be long(er)-lived and moves more of the relevant bookkeeping \
to the same place (the worker process). The resulting code is about the same size but cleaner \
and much more performant.</p> <p style="padding: 0;text-rendering: inherit;margin: \
0;line-height: inherit;white-space: inherit;">The tracking of file progress is also moved to \
the (less expensive, easier to update incrementally) sqlite db where file data is already kept. \
It also versions the sqlite schema so that runtime upgrades can be made (as required here) both \
now and in the future without user intervention.</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">Finally, a tiny (93 LOC, \
including header) executable that wraps the protocol encapsulation class is provided for 100% \
backwards compatibility of <em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">all</em> command line options that currently exist.</p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">This makes it a 100% drop-in replacement.</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">Performance of the affected unit \
tests is ~2.5x better on my laptop: 7.6 v 3.0 seconds.</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">Known weaknesses:</p> <ul \
style="padding: 0;text-rendering: inherit;margin: 0 0 0 1em;line-height: inherit;white-space: \
normal;"> <li style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">still isn't hardened against multiple processes indexing \
simultaneously. the added sqlite columns are a step in this direction, howeer</li> <li \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
normal;">the DBus signal that was being emitted was removed and has not been replaced with \
anything yet. That is because the affected class is in baloo-widgets (which happens to be the \
only user of the relevant <em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">public</em> class in the baloo file library) and I have not yet \
worked on that, pending this patch set being approved in principle. I am actually of the \
suspicion that keeping that data synchronized at all times is not needed .. but if it is, there \
are much better ways than DBus. Note that the DBus overhead was applied to <em style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: normal;">every single \
file indexed</em> even if Dolphin (the only user of this feature) wasn't running or had the \
metadata frame open.</li> </ul>
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">The base idea behind this patch is to make file indexing as blazing fast as possible, \
which means using as few resources, CPU &amp; memory, as possible. That ought to be the #1 \
priority for Baloo when it comes to indexing files, both to keep it away from user's notice on \
desktop class hardware as well as extend its relevance to smaller devices. The code in this \
patch is at least as clear (imho more so, actually) than what it replaces and was written for \
maintainability.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid \
#b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">Indexed files, \
ran testsuite with 0 failures.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/file/extractor/app.cpp <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/extractor/baloo_file_extractor.h <span style="color: \
grey">(a8cff51ae763c36fb2934c80c04cbf0f710a01c7)</span></li>

 <li>src/file/extractor/baloo_file_extractor.cpp <span style="color: \
grey">(b2e7f97f2f0537807c314e2aca008d67711a69ce)</span></li>

 <li>src/file/extractor/extractorworker.cpp <span style="color: \
grey">(e710908dc8b02b7edc9f40e44b9d5d5883986846)</span></li>

 <li>src/file/extractor/main.cpp <span style="color: \
grey">(606a1eb9d42b7079fee497a3fdffbe2327512c13)</span></li>

 <li>src/file/fileindexingjob.h <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingjob.cpp <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingqueue.h <span style="color: \
grey">(5b1c612feb84046acf36b3dec3bf890da2cbf650)</span></li>

 <li>src/file/fileindexingqueue.cpp <span style="color: \
grey">(c6479cd25308bee9bb6fe6e6791391e6ca83bddf)</span></li>

 <li>src/file/indexingqueue.cpp <span style="color: \
grey">(c8ae4d9d1436ebe222a4758bd5ce551e091ec714)</span></li>

 <li>src/file/lib/CMakeLists.txt <span style="color: \
grey">(fb158a5f00f07ab48de963bf72924c83fc1c6673)</span></li>

 <li>src/file/lib/extractorclient.h <span style="color: \
grey">(1e2ae09c7bd0068ca185b814c5c4996c0b563856)</span></li>

 <li>src/file/lib/extractorclient.cpp <span style="color: \
grey">(221b90abe1899c27c03cf77e279fb71afce80b4d)</span></li>

 <li>src/file/lib/filemapping.cpp <span style="color: \
grey">(5769fe23576c341917e8f37f7dd2743dd446a6e1)</span></li>

 <li>src/file/main.cpp <span style="color: \
grey">(41855215c7b550c403d5ef523d29f9b73db58cb0)</span></li>

 <li>src/file/metadatamover.cpp <span style="color: \
grey">(5117687c1247275235c3830b9d2234bfff353748)</span></li>

 <li>src/file/util.h <span style="color: \
grey">(a4cc2423fd46cda7146b02c572cc3f4ccd069e3e)</span></li>

 <li>src/file/util.cpp <span style="color: \
grey">(2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/scheduler/schedulertest.cpp <span style="color: \
grey">(b91a481138382b609068b4737c654ac5182c37ec)</span></li>

 <li>src/file/basicindexingjob.h <span style="color: \
grey">(3d2fb339e38be862332ddca45b4926f8000fa0c1)</span></li>

 <li>src/file/basicindexingjob.cpp <span style="color: \
grey">(8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84)</span></li>

 <li>src/file/basicindexingqueue.cpp <span style="color: \
grey">(8484c9b6868890cae6c2e19b29ea92d500aac03e)</span></li>

 <li>src/file/commitqueue.cpp <span style="color: \
grey">(b26f73cfdbc390cceef01aa6abd2b5af6f9689b2)</span></li>

 <li>src/file/database.h <span style="color: \
grey">(6548de7dbce003112dad8c58c39c624dd806a261)</span></li>

 <li>src/file/database.cpp <span style="color: \
grey">(5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f)</span></li>

 <li>src/file/extractor/CMakeLists.txt <span style="color: \
grey">(7cf3478f547f7fbcdf8908a3471b50fab09b5220)</span></li>

 <li>src/file/extractor/app.h <span style="color: \
grey">(8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de)</span></li>

 <li>.gitignore <span style="color: \
grey">(1377554ebea6f98a2c748183bc5a96852af12ac2)</span></li>

 <li>docs/file_extractor.md <span style="color: \
grey">(3d830e2313bd35994f9fd1d6b52df422d0556fd8)</span></li>

 <li>src/file/CMakeLists.txt <span style="color: \
grey">(1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622)</span></li>

 <li>src/file/autotest/CMakeLists.txt <span style="color: \
grey">(ccb8575fa8c07f2b136d259ed63d43a09ad75a12)</span></li>

 <li>src/file/autotest/basicindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/basicindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/CMakeLists.txt <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/extractor.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

</ul>

<p><a href="https://git.reviewboard.kde.org/r/120438/diff/" style="margin-left: 3em;">View \
Diff</a></p>






  </td>
 </tr>
</table>



  </div>
 </body>
</html>


--===============6708298779760466460==--


[Attachment #3 (text/plain)]


>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic