[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-devel
Subject: Re: Review Request 116692: Lower memory usage of akonadi_baloo_indexer with frequent commits
From: "Christian Mollekopf" <chrigi_1 () fastmail ! fm>
Date: 2014-03-21 10:58:35
Message-ID: 20140321105835.5913.64282 () probe ! kde ! org
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/116692/#review53637
-----------------------------------------------------------
It turned out that most of the memory was used the ItemFetchJob loading all items into memory. \
We've now optimized this, and for me the indexer never goes beyond ~250MB (initial indexing), \
and during normal usage stays around 10MB. I made some experiments with notmuch mail (which \
also uses xapian), and it also stayed around 200MB. This could probably be further tweaked by \
adjusting XAPIAN_FLUSH_THRESHOLD to lower the amounts of commits that are held in memory, but \
IMO 250MB for the initial indexing is a sane default value.
The only optimization that I think would be viable is releasing the memory again using \
malloc_free or alike (as we used to do in the nepomuk indexer).
So have the recent fixes also fixed the memory consumption for you or do you still think this \
patch should go in?
- Christian Mollekopf
On March 10, 2014, 11:12 a.m., Aaron J. Seigo wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/116692/
> -----------------------------------------------------------
>
> (Updated March 10, 2014, 11:12 a.m.)
>
>
> Review request for Akonadi and Baloo.
>
>
> Repository: baloo
>
>
> Description
> -------
>
> Baloo is using Xapian for storing processed results from data fed to it by akonadi; in doing \
> so it processes all the data it is sent to index and only once this is complete is the data \
> committed to the Xapian database. From \
> http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 \
> we see: "For efficiency reasons, when performing multiple updates to a database it is best \
> (indeed, almost essential) to make as many modifications as memory will permit in a single \
> pass through the database. To ensure this, Xapian batches up modifications." This means that \
> *all* the data to be stored in the Xapian database first ends up in RAM. When indexing large \
> mailboxes (or any other large chunk of data) this results in a very large amount of memory \
> allocation. On one test of 100k mails in a maildir folder this resulted in 1.5GB of RAM used. \
> In normal daily usage with maildir I find that it easily balloons to several hundred \
> megabytes within day
s. This makes the Baloo indexer unusable on systems with smaller amounts of memory (e.g. \
mobile devices, which typically have only 512MB-2GB of RAM)
>
> Making this even worse is that the indexer is both long-lived *and* the default glibc \
> allocator is unable to return the used memory back to the OS (probably due to memory \
> fragmentation, though I have not confirmed this). Use of other allocators shows the temporary \
> ballooning of memory during processing, but once that is done the memory is released and \
> returned back to the OS. As such, this is not a memory leak .. but it behaves like one on \
> systems with the default glibc allocator with akonai_baloo_indexer taking increasingly large \
> amounts of memory on the system that never get returned to the OS. (This is actually how I \
> noticed the problem in the first place.)
> The approach used to address this problem is to periodically commit data to the Xapian \
> database. This happens uniformly and transparently to the AbstractIndexer subclasses. The \
> exact behavior is controlled by the s_maxUncommittedItems constant which is set arbitrarily \
> to 100: after an indexer hits 100 uncommitted changes, the results are committed immediately. \
> Caveats:
> * This is not a guaranteed fix for the memory fragmentation issue experienced with glibc: it \
> is still possible for the memory to grow slowly over time as each smaller commit leaves some \
> % of un-releasable memory due to fragmentation. It has helped with day to day usage here, but \
> in the "100k mails in a maildir structure" test memory did still balloon upwards.
> * It make indexing non-atomic from akonadi's perspective: data fed to akonadi_baloo_indexer \
> to be indexed may show up in chunks and even, in the case of a crash of the indexer, be only \
> partially added to the database.
> Alternative approaches (not necessarily mutually exclusive to this patch or each other):
>
> * send smaller data sets from akonadi to akonadi_baloo_indexer for processing. This would \
> allow akonadi_baloo_indexer to retain the atomic commit approach while avoiding the worst of \
> the Xapian memory usage; it would not address the issue of memory \
> fragmentation
> * restart akonadi_baloo_indexer process from time to time; this would resolve the \
> fragmentation-over-time issue but not the massive memory usage due to atomically indexing \
> large datasets
> * improve Xapian's chert backend (to become default in 1.4) to not fragment memory so much; \
> this would not address the issue of massive memory usage due to atomically indexing large \
> datasets
> * use an allocator other than glibc's; this would not address the issue of massive memory \
> usage due to atomically indexing large datasets
>
> Diffs
> -----
>
> src/pim/agent/emailindexer.cpp 05f80cf
> src/pim/agent/abstractindexer.h 8ae6f5c
> src/pim/agent/abstractindexer.cpp fa9e96f
> src/pim/agent/akonotesindexer.h 83f36b7
> src/pim/agent/akonotesindexer.cpp ac3e66c
> src/pim/agent/contactindexer.h 49dfdeb
> src/pim/agent/contactindexer.cpp a5a6865
> src/pim/agent/emailindexer.h 9a5e5cf
>
> Diff: https://git.reviewboard.kde.org/r/116692/diff/
>
>
> Testing
> -------
>
> I have been running with the patch for a couple of days and one other person on irc has \
> tested an earlier (but functionally equivalent) version. Rather than reaching the common \
> 250MB+ during regular usage it now idles at ~20MB (up from ~7MB when first started; so some \
> fragmentation remains as noted in the description, but with far better long-term results)
>
> Thanks,
>
> Aaron J. Seigo
>
>
[Attachment #5 (text/html)]
<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 solid;">
<tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://git.reviewboard.kde.org/r/116692/">https://git.reviewboard.kde.org/r/116692/</a>
</td>
</tr>
</table>
<br />
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; \
white-space: -o-pre-wrap; word-wrap: break-word;">It turned out that most of the memory was \
used the ItemFetchJob loading all items into memory. We've now optimized this, and for me \
the indexer never goes beyond ~250MB (initial indexing), and during normal usage stays around \
10MB. I made some experiments with notmuch mail (which also uses xapian), and it also stayed \
around 200MB. This could probably be further tweaked by adjusting XAPIAN_FLUSH_THRESHOLD to \
lower the amounts of commits that are held in memory, but IMO 250MB for the initial indexing is \
a sane default value.
The only optimization that I think would be viable is releasing the memory again using \
malloc_free or alike (as we used to do in the nepomuk indexer).
So have the recent fixes also fixed the memory consumption for you or do you still think this \
patch should go in?</pre> <br />
<p>- Christian Mollekopf</p>
<br />
<p>On March 10th, 2014, 11:12 a.m. UTC, Aaron J. Seigo wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" style="background-image: \
url('https://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black solid;"> <tr>
<td>
<div>Review request for Akonadi and Baloo.</div>
<div>By Aaron J. Seigo.</div>
<p style="color: grey;"><i>Updated March 10, 2014, 11:12 a.m.</i></p>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt;">Repository: </b>
baloo
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px \
solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Baloo is using Xapian \
for storing processed results from data fed to it by akonadi; in doing so it processes all the \
data it is sent to index and only once this is complete is the data committed to the Xapian \
database. From http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 \
we see: "For efficiency reasons, when performing multiple updates to a database it is best \
(indeed, almost essential) to make as many modifications as memory will permit in a single pass \
through the database. To ensure this, Xapian batches up modifications." This means that \
*all* the data to be stored in the Xapian database first ends up in RAM. When indexing large \
mailboxes (or any other large chunk of data) this results in a very large amount of memory \
allocation. On one test of 1 00k mails in a maildir folder this resulted in 1.5GB of RAM used. \
In normal daily usage with maildir I find that it easily balloons to several hundred megabytes \
within days. This makes the Baloo indexer unusable on systems with smaller amounts of memory \
(e.g. mobile devices, which typically have only 512MB-2GB of RAM)
Making this even worse is that the indexer is both long-lived *and* the default glibc allocator \
is unable to return the used memory back to the OS (probably due to memory fragmentation, \
though I have not confirmed this). Use of other allocators shows the temporary ballooning of \
memory during processing, but once that is done the memory is released and returned back to the \
OS. As such, this is not a memory leak .. but it behaves like one on systems with the default \
glibc allocator with akonai_baloo_indexer taking increasingly large amounts of memory on the \
system that never get returned to the OS. (This is actually how I noticed the problem in the \
first place.)
The approach used to address this problem is to periodically commit data to the Xapian \
database. This happens uniformly and transparently to the AbstractIndexer subclasses. The exact \
behavior is controlled by the s_maxUncommittedItems constant which is set arbitrarily to 100: \
after an indexer hits 100 uncommitted changes, the results are committed immediately. Caveats:
* This is not a guaranteed fix for the memory fragmentation issue experienced with glibc: it is \
still possible for the memory to grow slowly over time as each smaller commit leaves some % of \
un-releasable memory due to fragmentation. It has helped with day to day usage here, but in the \
"100k mails in a maildir structure" test memory did still balloon upwards.
* It make indexing non-atomic from akonadi's perspective: data fed to akonadi_baloo_indexer \
to be indexed may show up in chunks and even, in the case of a crash of the indexer, be only \
partially added to the database.
Alternative approaches (not necessarily mutually exclusive to this patch or each other):
* send smaller data sets from akonadi to akonadi_baloo_indexer for processing. This would allow \
akonadi_baloo_indexer to retain the atomic commit approach while avoiding the worst of the \
Xapian memory usage; it would not address the issue of memory fragmentation
* restart akonadi_baloo_indexer process from time to time; this would resolve the \
fragmentation-over-time issue but not the massive memory usage due to atomically indexing large \
datasets
* improve Xapian's chert backend (to become default in 1.4) to not fragment memory so much; \
this would not address the issue of massive memory usage due to atomically indexing large \
datasets
* use an allocator other than glibc's; this would not address the issue of massive memory \
usage due to atomically indexing large datasets</pre> </td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid \
#b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; \
white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I have been running \
with the patch for a couple of days and one other person on irc has tested an earlier (but \
functionally equivalent) version. Rather than reaching the common 250MB+ during regular usage \
it now idles at ~20MB (up from ~7MB when first started; so some fragmentation remains as noted \
in the description, but with far better long-term results)</pre> </td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>src/pim/agent/emailindexer.cpp <span style="color: grey">(05f80cf)</span></li>
<li>src/pim/agent/abstractindexer.h <span style="color: grey">(8ae6f5c)</span></li>
<li>src/pim/agent/abstractindexer.cpp <span style="color: grey">(fa9e96f)</span></li>
<li>src/pim/agent/akonotesindexer.h <span style="color: grey">(83f36b7)</span></li>
<li>src/pim/agent/akonotesindexer.cpp <span style="color: grey">(ac3e66c)</span></li>
<li>src/pim/agent/contactindexer.h <span style="color: grey">(49dfdeb)</span></li>
<li>src/pim/agent/contactindexer.cpp <span style="color: grey">(a5a6865)</span></li>
<li>src/pim/agent/emailindexer.h <span style="color: grey">(9a5e5cf)</span></li>
</ul>
<p><a href="https://git.reviewboard.kde.org/r/116692/diff/" style="margin-left: 3em;">View \
Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic