[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Review Request 116692: Lower memory usage of akonadi_baloo_indexer with frequent commits
From:       "Christian Mollekopf" <chrigi_1 () fastmail ! fm>
Date:       2014-03-21 10:58:35
Message-ID: 20140321105835.5913.64282 () probe ! kde ! org
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/116692/#review53637
-----------------------------------------------------------


It turned out that most of the memory was used the ItemFetchJob loading all \
items into memory. We've now optimized this, and for me the indexer never \
goes beyond ~250MB (initial indexing), and during normal usage stays around \
10MB. I made some experiments with notmuch mail (which also uses xapian), \
and it also stayed around 200MB. This could probably be further tweaked by \
adjusting XAPIAN_FLUSH_THRESHOLD to lower the amounts of commits that are \
held in memory, but IMO 250MB for the initial indexing is a sane default \
value.

The only optimization that I think would be viable is releasing the memory \
again using malloc_free or alike (as we used to do in the nepomuk indexer).

So have the recent fixes also fixed the memory consumption for you or do \
you still think this patch should go in?

- Christian Mollekopf


On March 10, 2014, 11:12 a.m., Aaron J. Seigo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/116692/
> -----------------------------------------------------------
> 
> (Updated March 10, 2014, 11:12 a.m.)
> 
> 
> Review request for Akonadi and Baloo.
> 
> 
> Repository: baloo
> 
> 
> Description
> -------
> 
> Baloo is using Xapian for storing processed results from data fed to it \
> by akonadi; in doing so it processes all the data it is sent to index and \
> only once this is complete is the data committed to the Xapian database. \
> From http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 \
> we see: "For efficiency reasons, when performing multiple updates to a \
> database it is best (indeed, almost essential) to make as many \
> modifications as memory will permit in a single pass through the \
> database. To ensure this, Xapian batches up modifications." This means \
> that *all* the data to be stored in the Xapian database first ends up in \
> RAM. When indexing large mailboxes (or any other large chunk of data) \
> this results in a very large amount of memory allocation. On one test of \
> 100k mails in a maildir folder this resulted in 1.5GB of RAM used. In \
> normal daily usage with maildir I find that it easily balloons to several \
> hundred megabytes within day
 s. This makes the Baloo indexer unusable on systems with smaller amounts \
of memory (e.g. mobile devices, which typically have only 512MB-2GB of RAM)
> 
> Making this even worse is that the indexer is both long-lived *and* the \
> default glibc allocator is unable to return the used memory back to the \
> OS (probably due to memory fragmentation, though I have not confirmed \
> this). Use of other allocators shows the temporary ballooning of memory \
> during processing, but once that is done the memory is released and \
> returned back to the OS. As such, this is not a memory leak .. but it \
> behaves like one on systems with the default glibc allocator with \
> akonai_baloo_indexer taking increasingly large amounts of memory on the \
> system that never get returned to the OS. (This is actually how I noticed \
> the problem in the first place.) 
> The approach used to address this problem is to periodically commit data \
> to the Xapian database. This happens uniformly and transparently to the \
> AbstractIndexer subclasses. The exact behavior is controlled by the \
> s_maxUncommittedItems constant which is set arbitrarily to 100: after an \
> indexer hits 100 uncommitted changes, the results are committed \
> immediately. Caveats: 
> * This is not a guaranteed fix for the memory fragmentation issue \
> experienced with glibc: it is still possible for the memory to grow \
> slowly over time as each smaller commit leaves some % of un-releasable \
> memory due to fragmentation. It has helped with day to day usage here, \
> but in the "100k mails in a maildir structure" test memory did still \
> balloon upwards.  
> * It make indexing non-atomic from akonadi's perspective: data fed to \
> akonadi_baloo_indexer to be indexed may show up in chunks and even, in \
> the case of a crash of the indexer, be only partially added to the \
> database. 
> Alternative approaches (not necessarily mutually exclusive to this patch \
> or each other): 
> * send smaller data sets from akonadi to akonadi_baloo_indexer for \
> processing. This would allow akonadi_baloo_indexer to retain the atomic \
> commit approach while avoiding the worst of the Xapian memory usage; it \
>                 would not address the issue of memory fragmentation
> * restart akonadi_baloo_indexer process from time to time; this would \
> resolve the fragmentation-over-time issue but not the massive memory \
>                 usage due to atomically indexing large datasets
> * improve Xapian's chert backend (to become default in 1.4) to not \
> fragment memory so much; this would not address the issue of massive \
>                 memory usage due to atomically indexing large datasets
> * use an allocator other than glibc's; this would not address the issue \
> of massive memory usage due to atomically indexing large datasets 
> 
> Diffs
> -----
> 
> src/pim/agent/emailindexer.cpp 05f80cf 
> src/pim/agent/abstractindexer.h 8ae6f5c 
> src/pim/agent/abstractindexer.cpp fa9e96f 
> src/pim/agent/akonotesindexer.h 83f36b7 
> src/pim/agent/akonotesindexer.cpp ac3e66c 
> src/pim/agent/contactindexer.h 49dfdeb 
> src/pim/agent/contactindexer.cpp a5a6865 
> src/pim/agent/emailindexer.h 9a5e5cf 
> 
> Diff: https://git.reviewboard.kde.org/r/116692/diff/
> 
> 
> Testing
> -------
> 
> I have been running with the patch for a couple of days and one other \
> person on irc has tested an earlier (but functionally equivalent) \
> version. Rather than reaching the common 250MB+ during regular usage it \
> now idles at ~20MB (up from ~7MB when first started; so some \
> fragmentation remains as noted in the description, but with far better \
> long-term results) 
> 
> Thanks,
> 
> Aaron J. Seigo
> 
> 


[Attachment #5 (text/html)]

<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px \
#c9c399 solid;">  <tr>
     <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="https://git.reviewboard.kde.org/r/116692/">https://git.reviewboard.kde.org/r/116692/</a>
  </td>
    </tr>
   </table>
   <br />





 <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">It turned out that most of the memory was used the \
ItemFetchJob loading all items into memory. We&#39;ve now optimized this, \
and for me the indexer never goes beyond ~250MB (initial indexing), and \
during normal usage stays around 10MB. I made some experiments with notmuch \
mail (which also uses xapian), and it also stayed around 200MB. This could \
probably be further tweaked by adjusting XAPIAN_FLUSH_THRESHOLD to lower \
the amounts of commits that are held in memory, but IMO 250MB for the \
initial indexing is a sane default value.

The only optimization that I think would be viable is releasing the memory \
again using malloc_free or alike (as we used to do in the nepomuk indexer).

So have the recent fixes also fixed the memory consumption for you or do \
you still think this patch should go in?</pre>  <br />









<p>- Christian Mollekopf</p>


<br />
<p>On March 10th, 2014, 11:12 a.m. UTC, Aaron J. Seigo wrote:</p>








<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('https://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px \
black solid;">  <tr>
  <td>

<div>Review request for Akonadi and Baloo.</div>
<div>By Aaron J. Seigo.</div>


<p style="color: grey;"><i>Updated March 10, 2014, 11:12 a.m.</i></p>









<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
baloo
</div>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description \
</h1>  <table width="100%" bgcolor="#ffffff" cellspacing="0" \
cellpadding="10" style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">Baloo is using Xapian for storing processed results from data \
fed to it by akonadi; in doing so it processes all the data it is sent to \
index and only once this is complete is the data committed to the Xapian \
database. From http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 \
we see: &quot;For efficiency reasons, when performing multiple updates to a \
database it is best (indeed, almost essential) to make as many \
modifications as memory will permit in a single pass through the database. \
To ensure this, Xapian batches up modifications.&quot; This means that \
*all* the data to be stored in the Xapian database first ends up in RAM. \
When indexing large mailboxes (or any other large chunk of data) this \
results in a very large amount of memory allocation. On one test of 1  00k \
mails in a maildir folder this resulted in 1.5GB of RAM used. In normal \
daily usage with maildir I find that it easily balloons to several hundred \
megabytes within days. This makes the Baloo indexer unusable on systems \
with smaller amounts of memory (e.g. mobile devices, which typically have \
only 512MB-2GB of RAM)

Making this even worse is that the indexer is both long-lived *and* the \
default glibc allocator is unable to return the used memory back to the OS \
(probably due to memory fragmentation, though I have not confirmed this). \
Use of other allocators shows the temporary ballooning of memory during \
processing, but once that is done the memory is released and returned back \
to the OS. As such, this is not a memory leak .. but it behaves like one on \
systems with the default glibc allocator with akonai_baloo_indexer taking \
increasingly large amounts of memory on the system that never get returned \
to the OS. (This is actually how I noticed the problem in the first place.)

The approach used to address this problem is to periodically commit data to \
the Xapian database. This happens uniformly and transparently to the \
AbstractIndexer subclasses. The exact behavior is controlled by the \
s_maxUncommittedItems constant which is set arbitrarily to 100: after an \
indexer hits 100 uncommitted changes, the results are committed \
immediately. Caveats:

* This is not a guaranteed fix for the memory fragmentation issue \
experienced with glibc: it is still possible for the memory to grow slowly \
over time as each smaller commit leaves some % of un-releasable memory due \
to fragmentation. It has helped with day to day usage here, but in the \
&quot;100k mails in a maildir structure&quot; test memory did still balloon \
upwards. 

* It make indexing non-atomic from akonadi&#39;s perspective: data fed to \
akonadi_baloo_indexer to be indexed may show up in chunks and even, in the \
case of a crash of the indexer, be only partially added to the database.

Alternative approaches (not necessarily mutually exclusive to this patch or \
each other):

* send smaller data sets from akonadi to akonadi_baloo_indexer for \
processing. This would allow akonadi_baloo_indexer to retain the atomic \
commit approach while avoiding the worst of the Xapian memory usage; it \
                would not address the issue of memory fragmentation
* restart akonadi_baloo_indexer process from time to time; this would \
resolve the fragmentation-over-time issue but not the massive memory usage \
                due to atomically indexing large datasets
* improve Xapian&#39;s chert backend (to become default in 1.4) to not \
fragment memory so much; this would not address the issue of massive memory \
                usage due to atomically indexing large datasets
* use an allocator other than glibc&#39;s; this would not address the issue \
of massive memory usage due to atomically indexing large datasets</pre>  \
</td>  </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing \
</h1> <table width="100%" bgcolor="#ffffff" cellspacing="0" \
cellpadding="10" style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">I have been running with the patch for a couple of days and \
one other person on irc has tested an earlier (but functionally equivalent) \
version. Rather than reaching the common 250MB+ during regular usage it now \
idles at ~20MB (up from ~7MB when first started; so some fragmentation \
remains as noted in the description, but with far better long-term \
results)</pre>  </td>
 </tr>
</table>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> \
</h1> <ul style="margin-left: 3em; padding-left: 0;">

 <li>src/pim/agent/emailindexer.cpp <span style="color: \
grey">(05f80cf)</span></li>

 <li>src/pim/agent/abstractindexer.h <span style="color: \
grey">(8ae6f5c)</span></li>

 <li>src/pim/agent/abstractindexer.cpp <span style="color: \
grey">(fa9e96f)</span></li>

 <li>src/pim/agent/akonotesindexer.h <span style="color: \
grey">(83f36b7)</span></li>

 <li>src/pim/agent/akonotesindexer.cpp <span style="color: \
grey">(ac3e66c)</span></li>

 <li>src/pim/agent/contactindexer.h <span style="color: \
grey">(49dfdeb)</span></li>

 <li>src/pim/agent/contactindexer.cpp <span style="color: \
grey">(a5a6865)</span></li>

 <li>src/pim/agent/emailindexer.h <span style="color: \
grey">(9a5e5cf)</span></li>

</ul>

<p><a href="https://git.reviewboard.kde.org/r/116692/diff/" \
style="margin-left: 3em;">View Diff</a></p>







  </td>
 </tr>
</table>








  </div>
 </body>
</html>



>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic