[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Review Request 120438: improve baloo file indexing overhead by ~2.5x
From:       "Aaron J. Seigo" <aseigo () kde ! org>
Date:       2014-10-13 20:27:46
Message-ID: 20141013202746.21223.96980 () probe ! kde ! org
[Download RAW message or body]

--===============2813361265626602854==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit



> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > Hey Aaron. Thanks for putting this up for review. Could you please split this \
> > patch into multiple reviews? I see a mix of this patch, xapian->sqlite changes, \
> > new API, coding styles changes, and some improvements in the FileMapping class. 
> > The indexing progress seems to be now tracked in sqlite instead of Xapian. You \
> > mentioned it is more performant. Do you know why that is? Also, I'd love to know \
> > how these benchmarks were measured. I don't see any tests for benchmarks. Perhaps \
> > you forgot to add them to this patch? 
> > And finally, from what I understand this removed the need for the whole binary \
> > search when a file fails to index. Could you please add a test in order to make \
> > sure the extractor crashing is handled properly? Some tests for the file indexing \
> > queue would also be needed. 
> > How do you handle when a file takes too long to index? The FileIndexingJob had a \
> > timer for 5 minutes.

Sorry I haven't responded sooner but I was away for a week and have been in a job \
transition since then, and that's kept me occupied the last while. I'll hopefully \
have some time again to hack on personal / fun things opening up by week's end.

"Could you please split this patch into multiple reviews? I see a mix of this patch, \
xapian->sqlite changes, new API, coding styles changes, and some improvements in the \
FileMapping class."

The code style fixes are limited two small places. The only one that is entirely \
separate from functional changes is in indexingqueue.cpp. If you really want those \
bits reverted, I can do so, but I'm not going to submit a separate review for them on \
their own. Realistically, it's 6 curly braces (by my count) ... 

The rest of the changes belong together. The xapian->sqlite changes are necessary for \
the cross-process queue syncronization, the new API is the cross-process \
communication and the other changes all support those changes. Example: the sqlite db \
schema versioning is needed to perform the upgrade to the schema; without that ther \
would be code there to check if there are specific columns or not, and that would \
then get replaced by a more sane versioning (such as what is in this patch). There's \
really not much to break up further.

"The indexing progress seems to be now tracked in sqlite instead of Xapian."

Correct. The booleans stored in the xapian db as Z1 and Z2 were pretty obscure (what \
does 'Z1' mean?) and the sqlite db already has a listing of the files being indexed \
so is a natural place to track progress asyncronously between processes without \
requiring the xapian db to commit continuously.

"You mentioned it is more performant. Do you know why that is?"

Yes (see above)

"Also, I'd love to know how these benchmarks were measured."

The schedulertest. Handy as both an integration test and a small benchmark. Extending \
it to create more than 20 files per run is pretty simple.

"Could you please add a test in order to make sure the extractor crashing is handled \
properly?"

Sure. (Figuring out how to simulate it crashing, since it really shouldn't ever do \
that, will be fun and interesting ... but I'm sure it can be done)

"Some tests for the file indexing queue would also be needed."

What needs testing beyond what schedulertest already tests?

"How do you handle when a file takes too long to index?"

It doesn't currently; I'll add that.


> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > .gitignore, line 1
> > <https://git.reviewboard.kde.org/r/120438/diff/1/?file=315778#file315778line1>
> > 
> > Is this required?

no; doesn't have to be there. happy to leave it out.


> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > src/file/lib/extractorclient.h, line 30
> > <https://git.reviewboard.kde.org/r/120438/diff/1/?file=315808#file315808line30>
> > 
> > This add new public API.

Which makes sense if it is used by the Baloo widgets repository, unless they can \
share code privately. Otherwise, yes, it's fine to make this private if nothing else \
is going to use it.


> On Sept. 30, 2014, 4:25 p.m., Vishesh Handa wrote:
> > src/file/main.cpp, line 89
> > <https://git.reviewboard.kde.org/r/120438/diff/1/?file=315811#file315811line89>
> > 
> > No more transactions?

I dropped transactions so that the db can be used as a syncronization point for \
in-progress-indexing (which files are done, which are merely scheduled, etc.) Looking \
at the code, I could find nothing that gets added to the sqlite db which requires \
rollback as things are now committed one at a time. Using the sqlite db as a sync \
point means that processes can't lock the db for long periods of time which is \
exactly what was happening with the use of transactions.

I was concerned about performance, but it seems to not be an issue.


- Aaron J.


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/120438/#review67687
-----------------------------------------------------------


On Sept. 30, 2014, 12:47 p.m., Aaron J. Seigo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/120438/
> -----------------------------------------------------------
> 
> (Updated Sept. 30, 2014, 12:47 p.m.)
> 
> 
> Review request for Baloo.
> 
> 
> Repository: baloo
> 
> 
> Description
> -------
> 
> Current Baloo's file indexer takes batches of files by its command line and \
> processes them async. There are several iterations over and creation of various \
> collection classes. This patch set does away with all of this. 
> Instead, the worker process now is directed over stdin and responds over stdout \
> (old school UNIX!). This is done using a very simple protocol which is fully \
> documented in docs/ and is encapsulated in a convenience class so as to make future \
> maintenance relatively painless in the advent that the protocol is adjusted or an \
> entirely new system is put inplace. IOW, the next me should have an easier time \
> improving on what is there. This allows the worker process to be long(er)-lived and \
> moves more of the relevant bookkeeping to the same place (the worker process). The \
> resulting code is about the same size but cleaner and much more performant. 
> The tracking of file progress is also moved to the (less expensive, easier to \
> update incrementally) sqlite db where file data is already kept. It also versions \
> the sqlite schema so that runtime upgrades can be made (as required here) both now \
> and in the future without user intervention. 
> Finally, a tiny (93 LOC, including header) executable that wraps the protocol \
> encapsulation class is provided for 100% backwards compatibility of *all* command \
> line options that currently exist. 
> This makes it a 100% drop-in replacement.
> 
> Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 3.0 \
> seconds. 
> Known weaknesses:
> 
> * still isn't hardened against multiple processes indexing simultaneously. the \
>                 added sqlite columns are a step in this direction, howeer
> * the DBus signal that was being emitted was removed and has not been replaced with \
> anything yet. That is because the affected class is in baloo-widgets (which happens \
> to be the only user of the relevant *public* class in the baloo file library) and I \
> have not yet worked on that, pending this patch set being approved in principle. I \
> am actually of the suspicion that keeping that data synchronized at all times is \
> not needed .. but if it is, there are much better ways than DBus. Note that the \
> DBus overhead was applied to *every single file indexed* even if Dolphin (the only \
> user of this feature) wasn't running or had the metadata frame open. 
> The base idea behind this patch is to make file indexing as blazing fast as \
> possible, which means using as few resources, CPU & memory, as possible. That ought \
> to be the #1 priority for Baloo when it comes to indexing files, both to keep it \
> away from user's notice on desktop class hardware as well as extend its relevance \
> to smaller devices. The code in this patch is at least as clear (imho more so, \
> actually) than what it replaces and was written for maintainability. 
> 
> Diffs
> -----
> 
> src/file/extractor/app.cpp PRE-CREATION 
> src/file/extractor/baloo_file_extractor.h a8cff51ae763c36fb2934c80c04cbf0f710a01c7 
> src/file/extractor/baloo_file_extractor.cpp \
> b2e7f97f2f0537807c314e2aca008d67711a69ce  src/file/extractor/extractorworker.cpp \
> e710908dc8b02b7edc9f40e44b9d5d5883986846  src/file/extractor/main.cpp \
> 606a1eb9d42b7079fee497a3fdffbe2327512c13  src/file/fileindexingjob.h PRE-CREATION 
> src/file/fileindexingjob.cpp PRE-CREATION 
> src/file/fileindexingqueue.h 5b1c612feb84046acf36b3dec3bf890da2cbf650 
> src/file/fileindexingqueue.cpp c6479cd25308bee9bb6fe6e6791391e6ca83bddf 
> src/file/indexingqueue.cpp c8ae4d9d1436ebe222a4758bd5ce551e091ec714 
> src/file/lib/CMakeLists.txt fb158a5f00f07ab48de963bf72924c83fc1c6673 
> src/file/lib/extractorclient.h 1e2ae09c7bd0068ca185b814c5c4996c0b563856 
> src/file/lib/extractorclient.cpp 221b90abe1899c27c03cf77e279fb71afce80b4d 
> src/file/lib/filemapping.cpp 5769fe23576c341917e8f37f7dd2743dd446a6e1 
> src/file/main.cpp 41855215c7b550c403d5ef523d29f9b73db58cb0 
> src/file/metadatamover.cpp 5117687c1247275235c3830b9d2234bfff353748 
> src/file/util.h a4cc2423fd46cda7146b02c572cc3f4ccd069e3e 
> src/file/util.cpp 2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336 
> src/file/autotest/fileindexingjob/fileindexingjobtest.h PRE-CREATION 
> src/file/autotest/fileindexingjob/fileindexingjobtest.cpp PRE-CREATION 
> src/file/autotest/scheduler/schedulertest.cpp \
> b91a481138382b609068b4737c654ac5182c37ec  src/file/basicindexingjob.h \
> 3d2fb339e38be862332ddca45b4926f8000fa0c1  src/file/basicindexingjob.cpp \
> 8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84  src/file/basicindexingqueue.cpp \
> 8484c9b6868890cae6c2e19b29ea92d500aac03e  src/file/commitqueue.cpp \
> b26f73cfdbc390cceef01aa6abd2b5af6f9689b2  src/file/database.h \
> 6548de7dbce003112dad8c58c39c624dd806a261  src/file/database.cpp \
> 5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f  src/file/extractor/CMakeLists.txt \
> 7cf3478f547f7fbcdf8908a3471b50fab09b5220  src/file/extractor/app.h \
>                 8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de 
> .gitignore 1377554ebea6f98a2c748183bc5a96852af12ac2 
> docs/file_extractor.md 3d830e2313bd35994f9fd1d6b52df422d0556fd8 
> src/file/CMakeLists.txt 1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622 
> src/file/autotest/CMakeLists.txt ccb8575fa8c07f2b136d259ed63d43a09ad75a12 
> src/file/autotest/basicindexingjobtest.h PRE-CREATION 
> src/file/autotest/basicindexingjobtest.cpp PRE-CREATION 
> src/file/autotest/fileindexingjob/CMakeLists.txt PRE-CREATION 
> src/file/autotest/fileindexingjob/extractor.cpp PRE-CREATION 
> 
> Diff: https://git.reviewboard.kde.org/r/120438/diff/
> 
> 
> Testing
> -------
> 
> Indexed files, ran testsuite with 0 failures.
> 
> 
> Thanks,
> 
> Aaron J. Seigo
> 
> 


--===============2813361265626602854==
MIME-Version: 1.0
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 7bit




<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="12" style="border: 1px #c9c399 \
solid; border-radius: 6px; -moz-border-radius: 6px; -webkit-border-radius: 6px;">  \
<tr>  <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="https://git.reviewboard.kde.org/r/120438/">https://git.reviewboard.kde.org/r/120438/</a>
  </td>
    </tr>
   </table>
   <br />





<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <p style="margin-top: 0;">On September 30th, 2014, 4:25 p.m. UTC, <b>Vishesh \
Handa</b> wrote:</p>  <blockquote style="margin-left: 1em; border-left: 2px solid \
#d0d0d0; padding-left: 10px;">  <pre style="white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;"><p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Hey Aaron. Thanks for putting this up for review. \
Could you please split this patch into multiple reviews? I see a mix of this patch, \
xapian-&gt;sqlite changes, new API, coding styles changes, and some improvements in \
the FileMapping class.</p> <p style="padding: 0;text-rendering: inherit;margin: \
0;line-height: inherit;white-space: inherit;">The indexing progress seems to be now \
tracked in sqlite instead of Xapian. You mentioned it is more performant. Do you know \
why that is? Also, I'd love to know how these benchmarks were measured. I don't see \
any tests for benchmarks. Perhaps you forgot to add them to this patch?</p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">And finally, from what I understand this removed the need for the whole \
binary search when a file fails to index. Could you please add a test in order to \
make sure the extractor crashing is handled properly? Some tests for the file \
indexing queue would also be needed.</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">How do you handle when \
a file takes too long to index? The FileIndexingJob had a timer for 5 \
minutes.</p></pre>  </blockquote>







</blockquote>

<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">Sorry \
I haven't responded sooner but I was away for a week and have been in a job \
transition since then, and that's kept me occupied the last while. I'll hopefully \
have some time again to hack on personal / fun things opening up by week's end.</p> \
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">"Could you please split this patch into multiple \
reviews? I see a mix of this patch, xapian-&gt;sqlite changes, new API, coding styles \
changes, and some improvements in the FileMapping class."</p> <p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">The \
code style fixes are limited two small places. The only one that is entirely separate \
from functional changes is in indexingqueue.cpp. If you really want those bits \
reverted, I can do so, but I'm not going to submit a separate review for them on \
their own. Realistically, it's 6 curly braces (by my count) ... </p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">The rest of the changes belong together. The xapian-&gt;sqlite changes are \
necessary for the cross-process queue syncronization, the new API is the \
cross-process communication and the other changes all support those changes. Example: \
the sqlite db schema versioning is needed to perform the upgrade to the schema; \
without that ther would be code there to check if there are specific columns or not, \
and that would then get replaced by a more sane versioning (such as what is in this \
patch). There's really not much to break up further.</p> <p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">"The \
indexing progress seems to be now tracked in sqlite instead of Xapian."</p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">Correct. The booleans stored in the xapian db as Z1 and Z2 were pretty \
obscure (what does 'Z1' mean?) and the sqlite db already has a listing of the files \
being indexed so is a natural place to track progress asyncronously between processes \
without requiring the xapian db to commit continuously.</p> <p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">"You \
mentioned it is more performant. Do you know why that is?"</p> <p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">Yes \
(see above)</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">"Also, I'd love to know how these benchmarks were \
measured."</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">The schedulertest. Handy as both an integration test \
and a small benchmark. Extending it to create more than 20 files per run is pretty \
simple.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">"Could you please add a test in order to make sure the \
extractor crashing is handled properly?"</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">Sure. (Figuring out how \
to simulate it crashing, since it really shouldn't ever do that, will be fun and \
interesting ... but I'm sure it can be done)</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">"Some tests for the \
file indexing queue would also be needed."</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">What needs testing \
beyond what schedulertest already tests?</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">"How do you handle when \
a file takes too long to index?"</p> <p style="padding: 0;text-rendering: \
inherit;margin: 0;line-height: inherit;white-space: inherit;">It doesn't currently; \
I'll add that.</p></pre> <br />







<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <p style="margin-top: 0;">On September 30th, 2014, 4:25 p.m. UTC, <b>Vishesh \
Handa</b> wrote:</p>  <blockquote style="margin-left: 1em; border-left: 2px solid \
#d0d0d0; padding-left: 10px;">  


<table width="100%" border="0" bgcolor="white" style="border: 1px solid #C0C0C0; \
border-collapse: collapse; margin: 2px padding: 2px;">  <thead>
  <tr>
   <th colspan="4" bgcolor="#F0F0F0" style="border-bottom: 1px solid #C0C0C0; \
font-size: 9pt; padding: 4px 8px; text-align: left;">  <a \
href="https://git.reviewboard.kde.org/r/120438/diff/1/?file=315778#file315778line1" \
style="color: black; font-weight: bold; text-decoration: underline;">.gitignore</a>  \
<span style="font-weight: normal;">

     (Diff revision 1)

    </span>
   </th>
  </tr>
 </thead>



 
 

 <tbody>

  <tr>
    <th bgcolor="#b1ebb0" style="border-right: 1px solid #C0C0C0;" \
align="right"><font size="2"></font></th>  <td bgcolor="#c5ffc4" width="50%"><pre \
style="font-size: 8pt; line-height: 140%; margin: 0; "></pre></td>  <th \
bgcolor="#b1ebb0" style="border-left: 1px solid #C0C0C0; border-right: 1px solid \
#C0C0C0;" align="right"><font size="2">1</font></th>  <td bgcolor="#c5ffc4" \
width="50%"><pre style="font-size: 8pt; line-height: 140%; margin: 0; \
">*.swp</pre></td>  </tr>

 </tbody>

</table>

  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">Is \
this required?</p></pre>  </blockquote>





</blockquote>
<pre style="margin-left: 1em; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">no; doesn't have to be there. happy to leave it out.</p></pre> <br />

<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <p style="margin-top: 0;">On September 30th, 2014, 4:25 p.m. UTC, <b>Vishesh \
Handa</b> wrote:</p>  <blockquote style="margin-left: 1em; border-left: 2px solid \
#d0d0d0; padding-left: 10px;">  


<table width="100%" border="0" bgcolor="white" style="border: 1px solid #C0C0C0; \
border-collapse: collapse; margin: 2px padding: 2px;">  <thead>
  <tr>
   <th colspan="4" bgcolor="#F0F0F0" style="border-bottom: 1px solid #C0C0C0; \
font-size: 9pt; padding: 4px 8px; text-align: left;">  <a \
href="https://git.reviewboard.kde.org/r/120438/diff/1/?file=315808#file315808line30" \
style="color: black; font-weight: bold; text-decoration: \
underline;">src/file/lib/extractorclient.h</a>  <span style="font-weight: normal;">

     (Diff revision 1)

    </span>
   </th>
  </tr>
 </thead>



 
 

 <tbody>

  <tr>
    <th bgcolor="#b1ebb0" style="border-right: 1px solid #C0C0C0;" \
align="right"><font size="2"></font></th>  <td bgcolor="#c5ffc4" width="50%"><pre \
style="font-size: 8pt; line-height: 140%; margin: 0; "></pre></td>  <th \
bgcolor="#b1ebb0" style="border-left: 1px solid #C0C0C0; border-right: 1px solid \
#C0C0C0;" align="right"><font size="2">30</font></th>  <td bgcolor="#c5ffc4" \
width="50%"><pre style="font-size: 8pt; line-height: 140%; margin: 0; "><span \
class="n">class</span> <span class="n">BALOO_FILE_EXPORT</span> <span \
class="n">ExtractorClient</span> <span class="o">:</span> <span \
class="n">public</span> <span class="n">QObject</span></pre></td>  </tr>

 </tbody>

</table>

  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">This \
add new public API.</p></pre>  </blockquote>





</blockquote>
<pre style="margin-left: 1em; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">Which makes sense if it is used by the Baloo widgets repository, unless \
they can share code privately. Otherwise, yes, it's fine to make this private if \
nothing else is going to use it.</p></pre> <br />

<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <p style="margin-top: 0;">On September 30th, 2014, 4:25 p.m. UTC, <b>Vishesh \
Handa</b> wrote:</p>  <blockquote style="margin-left: 1em; border-left: 2px solid \
#d0d0d0; padding-left: 10px;">  


<table width="100%" border="0" bgcolor="white" style="border: 1px solid #C0C0C0; \
border-collapse: collapse; margin: 2px padding: 2px;">  <thead>
  <tr>
   <th colspan="4" bgcolor="#F0F0F0" style="border-bottom: 1px solid #C0C0C0; \
font-size: 9pt; padding: 4px 8px; text-align: left;">  <a \
href="https://git.reviewboard.kde.org/r/120438/diff/1/?file=315811#file315811line89" \
style="color: black; font-weight: bold; text-decoration: \
underline;">src/file/main.cpp</a>  <span style="font-weight: normal;">

     (Diff revision 1)

    </span>
   </th>
  </tr>
 </thead>

 <tbody style="background-color: #e4d9cb; padding: 4px 8px; text-align: center;">
  <tr>

   <td colspan="4"><pre style="font-size: 8pt; line-height: 140%; margin: 0; ">int \
main(int argc, char** argv)</pre></td>

  </tr>
 </tbody>



 
 

 <tbody>

  <tr>
    <th bgcolor="#ebb1ba" style="border-right: 1px solid #C0C0C0;" \
align="right"><font size="2">89</font></th>  <td bgcolor="#ffc5ce" width="50%"><pre \
style="font-size: 8pt; line-height: 140%; margin: 0; ">    <span \
class="n">db</span><span class="p">.</span><span class="n">sqlDatabase</span><span \
class="p">().</span><span class="n">transaction</span><span \
class="p">();</span></pre></td>  <th bgcolor="#ebb1ba" style="border-left: 1px solid \
#C0C0C0; border-right: 1px solid #C0C0C0;" align="right"><font size="2"></font></th>  \
<td bgcolor="#ffc5ce" width="50%"><pre style="font-size: 8pt; line-height: 140%; \
margin: 0; "></pre></td>  </tr>

 </tbody>

</table>

  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">No \
more transactions?</p></pre>  </blockquote>





</blockquote>
<pre style="margin-left: 1em; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"><p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">I dropped transactions so that the db can be used as a syncronization point \
for in-progress-indexing (which files are done, which are merely scheduled, etc.) \
Looking at the code, I could find nothing that gets added to the sqlite db which \
requires rollback as things are now committed one at a time. Using the sqlite db as a \
sync point means that processes can't lock the db for long periods of time which is \
exactly what was happening with the use of transactions.</p> <p style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: inherit;">I was \
concerned about performance, but it seems to not be an issue.</p></pre> <br />




<p>- Aaron J.</p>


<br />
<p>On September 30th, 2014, 12:47 p.m. UTC, Aaron J. Seigo wrote:</p>









<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="12" style="border: \
1px #888a85 solid; border-radius: 6px; -moz-border-radius: 6px; \
-webkit-border-radius: 6px;">  <tr>
  <td>

<div>Review request for Baloo.</div>
<div>By Aaron J. Seigo.</div>


<p style="color: grey;"><i>Updated Sept. 30, 2014, 12:47 p.m.</i></p>









<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
baloo
</div>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;"><p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Current Baloo's file indexer takes batches of files by \
its command line and processes them async. There are several iterations over and \
creation of various collection classes. This patch set does away with all of \
this.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Instead, the worker process now is directed over stdin \
and responds over stdout (old school UNIX!). This is done using a very simple \
protocol which is fully documented in docs/ and is encapsulated in a convenience \
class so as to make future maintenance relatively painless in the advent that the \
protocol is adjusted or an entirely new system is put inplace. IOW, the next me \
should have an easier time improving on what is there. This allows the worker process \
to be long(er)-lived and moves more of the relevant bookkeeping to the same place \
(the worker process). The resulting code is about the same size but cleaner and much \
more performant.</p> <p style="padding: 0;text-rendering: inherit;margin: \
0;line-height: inherit;white-space: inherit;">The tracking of file progress is also \
moved to the (less expensive, easier to update incrementally) sqlite db where file \
data is already kept. It also versions the sqlite schema so that runtime upgrades can \
be made (as required here) both now and in the future without user intervention.</p> \
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Finally, a tiny (93 LOC, including header) executable \
that wraps the protocol encapsulation class is provided for 100% backwards \
compatibility of <em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">all</em> command line options that currently exist.</p> \
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">This makes it a 100% drop-in replacement.</p> <p \
style="padding: 0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
inherit;">Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v \
3.0 seconds.</p> <p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Known weaknesses:</p> <ul style="padding: \
0;text-rendering: inherit;margin: 0 0 0 1em;line-height: inherit;white-space: \
normal;"> <li style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">still isn't hardened against multiple processes \
indexing simultaneously. the added sqlite columns are a step in this direction, \
howeer</li> <li style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">the DBus signal that was being emitted was removed and \
has not been replaced with anything yet. That is because the affected class is in \
baloo-widgets (which happens to be the only user of the relevant <em style="padding: \
0;text-rendering: inherit;margin: 0;line-height: inherit;white-space: \
normal;">public</em> class in the baloo file library) and I have not yet worked on \
that, pending this patch set being approved in principle. I am actually of the \
suspicion that keeping that data synchronized at all times is not needed .. but if it \
is, there are much better ways than DBus. Note that the DBus overhead was applied to \
<em style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: normal;">every single file indexed</em> even if Dolphin (the \
only user of this feature) wasn't running or had the metadata frame open.</li> </ul>
<p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">The base idea behind this patch is to make file \
indexing as blazing fast as possible, which means using as few resources, CPU &amp; \
memory, as possible. That ought to be the #1 priority for Baloo when it comes to \
indexing files, both to keep it away from user's notice on desktop class hardware as \
well as extend its relevance to smaller devices. The code in this patch is at least \
as clear (imho more so, actually) than what it replaces and was written for \
maintainability.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: \
1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;"><p style="padding: 0;text-rendering: inherit;margin: 0;line-height: \
inherit;white-space: inherit;">Indexed files, ran testsuite with 0 \
failures.</p></pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/file/extractor/app.cpp <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/extractor/baloo_file_extractor.h <span style="color: \
grey">(a8cff51ae763c36fb2934c80c04cbf0f710a01c7)</span></li>

 <li>src/file/extractor/baloo_file_extractor.cpp <span style="color: \
grey">(b2e7f97f2f0537807c314e2aca008d67711a69ce)</span></li>

 <li>src/file/extractor/extractorworker.cpp <span style="color: \
grey">(e710908dc8b02b7edc9f40e44b9d5d5883986846)</span></li>

 <li>src/file/extractor/main.cpp <span style="color: \
grey">(606a1eb9d42b7079fee497a3fdffbe2327512c13)</span></li>

 <li>src/file/fileindexingjob.h <span style="color: grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingjob.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/fileindexingqueue.h <span style="color: \
grey">(5b1c612feb84046acf36b3dec3bf890da2cbf650)</span></li>

 <li>src/file/fileindexingqueue.cpp <span style="color: \
grey">(c6479cd25308bee9bb6fe6e6791391e6ca83bddf)</span></li>

 <li>src/file/indexingqueue.cpp <span style="color: \
grey">(c8ae4d9d1436ebe222a4758bd5ce551e091ec714)</span></li>

 <li>src/file/lib/CMakeLists.txt <span style="color: \
grey">(fb158a5f00f07ab48de963bf72924c83fc1c6673)</span></li>

 <li>src/file/lib/extractorclient.h <span style="color: \
grey">(1e2ae09c7bd0068ca185b814c5c4996c0b563856)</span></li>

 <li>src/file/lib/extractorclient.cpp <span style="color: \
grey">(221b90abe1899c27c03cf77e279fb71afce80b4d)</span></li>

 <li>src/file/lib/filemapping.cpp <span style="color: \
grey">(5769fe23576c341917e8f37f7dd2743dd446a6e1)</span></li>

 <li>src/file/main.cpp <span style="color: \
grey">(41855215c7b550c403d5ef523d29f9b73db58cb0)</span></li>

 <li>src/file/metadatamover.cpp <span style="color: \
grey">(5117687c1247275235c3830b9d2234bfff353748)</span></li>

 <li>src/file/util.h <span style="color: \
grey">(a4cc2423fd46cda7146b02c572cc3f4ccd069e3e)</span></li>

 <li>src/file/util.cpp <span style="color: \
grey">(2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/fileindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/scheduler/schedulertest.cpp <span style="color: \
grey">(b91a481138382b609068b4737c654ac5182c37ec)</span></li>

 <li>src/file/basicindexingjob.h <span style="color: \
grey">(3d2fb339e38be862332ddca45b4926f8000fa0c1)</span></li>

 <li>src/file/basicindexingjob.cpp <span style="color: \
grey">(8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84)</span></li>

 <li>src/file/basicindexingqueue.cpp <span style="color: \
grey">(8484c9b6868890cae6c2e19b29ea92d500aac03e)</span></li>

 <li>src/file/commitqueue.cpp <span style="color: \
grey">(b26f73cfdbc390cceef01aa6abd2b5af6f9689b2)</span></li>

 <li>src/file/database.h <span style="color: \
grey">(6548de7dbce003112dad8c58c39c624dd806a261)</span></li>

 <li>src/file/database.cpp <span style="color: \
grey">(5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f)</span></li>

 <li>src/file/extractor/CMakeLists.txt <span style="color: \
grey">(7cf3478f547f7fbcdf8908a3471b50fab09b5220)</span></li>

 <li>src/file/extractor/app.h <span style="color: \
grey">(8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de)</span></li>

 <li>.gitignore <span style="color: \
grey">(1377554ebea6f98a2c748183bc5a96852af12ac2)</span></li>

 <li>docs/file_extractor.md <span style="color: \
grey">(3d830e2313bd35994f9fd1d6b52df422d0556fd8)</span></li>

 <li>src/file/CMakeLists.txt <span style="color: \
grey">(1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622)</span></li>

 <li>src/file/autotest/CMakeLists.txt <span style="color: \
grey">(ccb8575fa8c07f2b136d259ed63d43a09ad75a12)</span></li>

 <li>src/file/autotest/basicindexingjobtest.h <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/basicindexingjobtest.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/CMakeLists.txt <span style="color: \
grey">(PRE-CREATION)</span></li>

 <li>src/file/autotest/fileindexingjob/extractor.cpp <span style="color: \
grey">(PRE-CREATION)</span></li>

</ul>

<p><a href="https://git.reviewboard.kde.org/r/120438/diff/" style="margin-left: \
3em;">View Diff</a></p>






  </td>
 </tr>
</table>








  </div>
 </body>
</html>


--===============2813361265626602854==--



>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic