From kde-devel Fri Feb 27 14:08:02 2015 From: "Aaron J. Seigo" Date: Fri, 27 Feb 2015 14:08:02 +0000 To: kde-devel Subject: Re: Review Request 120438: improve baloo file indexing overhead by ~2.5x Message-Id: <20150227140802.12883.85283 () probe ! kde ! org> X-MARC-Message: https://marc.info/?l=kde-devel&m=142504611129259 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--===============8330820678844667509==" --===============8330820678844667509== Content-Type: multipart/alternative; boundary="===============6708298779760466460==" --===============6708298779760466460== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/120438/ ----------------------------------------------------------- (Updated Feb. 27, 2015, 2:08 p.m.) Status ------ This change has been discarded. Review request for Baloo. Repository: baloo Description ------- Current Baloo's file indexer takes batches of files by its command line and processes them async. There are several iterations over and creation of various collection classes. This patch set does away with all of this. Instead, the worker process now is directed over stdin and responds over stdout (old school UNIX!). This is done using a very simple protocol which is fully documented in docs/ and is encapsulated in a convenience class so as to make future maintenance relatively painless in the advent that the protocol is adjusted or an entirely new system is put inplace. IOW, the next me should have an easier time improving on what is there. This allows the worker process to be long(er)-lived and moves more of the relevant bookkeeping to the same place (the worker process). The resulting code is about the same size but cleaner and much more performant. The tracking of file progress is also moved to the (less expensive, easier to update incrementally) sqlite db where file data is already kept. It also versions the sqlite schema so that runtime upgrades can be made (as required here) both now and in the future without user intervention. Finally, a tiny (93 LOC, including header) executable that wraps the protocol encapsulation class is provided for 100% backwards compatibility of *all* command line options that currently exist. This makes it a 100% drop-in replacement. Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 3.0 seconds. Known weaknesses: * still isn't hardened against multiple processes indexing simultaneously. the added sqlite columns are a step in this direction, howeer * the DBus signal that was being emitted was removed and has not been replaced with anything yet. That is because the affected class is in baloo-widgets (which happens to be the only user of the relevant *public* class in the baloo file library) and I have not yet worked on that, pending this patch set being approved in principle. I am actually of the suspicion that keeping that data synchronized at all times is not needed .. but if it is, there are much better ways than DBus. Note that the DBus overhead was applied to *every single file indexed* even if Dolphin (the only user of this feature) wasn't running or had the metadata frame open. The base idea behind this patch is to make file indexing as blazing fast as possible, which means using as few resources, CPU & memory, as possible. That ought to be the #1 priority for Baloo when it comes to indexing files, both to keep it away from user's notice on desktop class hardware as well as extend its relevance to smaller devices. The code in this patch is at least as clear (imho more so, actually) than what it replaces and was written for maintainability. Diffs ----- src/file/extractor/app.cpp PRE-CREATION src/file/extractor/baloo_file_extractor.h a8cff51ae763c36fb2934c80c04cbf0f710a01c7 src/file/extractor/baloo_file_extractor.cpp b2e7f97f2f0537807c314e2aca008d67711a69ce src/file/extractor/extractorworker.cpp e710908dc8b02b7edc9f40e44b9d5d5883986846 src/file/extractor/main.cpp 606a1eb9d42b7079fee497a3fdffbe2327512c13 src/file/fileindexingjob.h PRE-CREATION src/file/fileindexingjob.cpp PRE-CREATION src/file/fileindexingqueue.h 5b1c612feb84046acf36b3dec3bf890da2cbf650 src/file/fileindexingqueue.cpp c6479cd25308bee9bb6fe6e6791391e6ca83bddf src/file/indexingqueue.cpp c8ae4d9d1436ebe222a4758bd5ce551e091ec714 src/file/lib/CMakeLists.txt fb158a5f00f07ab48de963bf72924c83fc1c6673 src/file/lib/extractorclient.h 1e2ae09c7bd0068ca185b814c5c4996c0b563856 src/file/lib/extractorclient.cpp 221b90abe1899c27c03cf77e279fb71afce80b4d src/file/lib/filemapping.cpp 5769fe23576c341917e8f37f7dd2743dd446a6e1 src/file/main.cpp 41855215c7b550c403d5ef523d29f9b73db58cb0 src/file/metadatamover.cpp 5117687c1247275235c3830b9d2234bfff353748 src/file/util.h a4cc2423fd46cda7146b02c572cc3f4ccd069e3e src/file/util.cpp 2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336 src/file/autotest/fileindexingjob/fileindexingjobtest.h PRE-CREATION src/file/autotest/fileindexingjob/fileindexingjobtest.cpp PRE-CREATION src/file/autotest/scheduler/schedulertest.cpp b91a481138382b609068b4737c654ac5182c37ec src/file/basicindexingjob.h 3d2fb339e38be862332ddca45b4926f8000fa0c1 src/file/basicindexingjob.cpp 8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84 src/file/basicindexingqueue.cpp 8484c9b6868890cae6c2e19b29ea92d500aac03e src/file/commitqueue.cpp b26f73cfdbc390cceef01aa6abd2b5af6f9689b2 src/file/database.h 6548de7dbce003112dad8c58c39c624dd806a261 src/file/database.cpp 5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f src/file/extractor/CMakeLists.txt 7cf3478f547f7fbcdf8908a3471b50fab09b5220 src/file/extractor/app.h 8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de .gitignore 1377554ebea6f98a2c748183bc5a96852af12ac2 docs/file_extractor.md 3d830e2313bd35994f9fd1d6b52df422d0556fd8 src/file/CMakeLists.txt 1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622 src/file/autotest/CMakeLists.txt ccb8575fa8c07f2b136d259ed63d43a09ad75a12 src/file/autotest/basicindexingjobtest.h PRE-CREATION src/file/autotest/basicindexingjobtest.cpp PRE-CREATION src/file/autotest/fileindexingjob/CMakeLists.txt PRE-CREATION src/file/autotest/fileindexingjob/extractor.cpp PRE-CREATION Diff: https://git.reviewboard.kde.org/r/120438/diff/ Testing ------- Indexed files, ran testsuite with 0 failures. Thanks, Aaron J. Seigo --===============6708298779760466460== MIME-Version: 1.0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: 7bit
This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/120438/

This change has been discarded.


Review request for Baloo.
By Aaron J. Seigo.

Updated Feb. 27, 2015, 2:08 p.m.

Repository: baloo

Description

Current Baloo's file indexer takes batches of files by its command line and processes them async. There are several iterations over and creation of various collection classes. This patch set does away with all of this.

Instead, the worker process now is directed over stdin and responds over stdout (old school UNIX!). This is done using a very simple protocol which is fully documented in docs/ and is encapsulated in a convenience class so as to make future maintenance relatively painless in the advent that the protocol is adjusted or an entirely new system is put inplace. IOW, the next me should have an easier time improving on what is there. This allows the worker process to be long(er)-lived and moves more of the relevant bookkeeping to the same place (the worker process). The resulting code is about the same size but cleaner and much more performant.

The tracking of file progress is also moved to the (less expensive, easier to update incrementally) sqlite db where file data is already kept. It also versions the sqlite schema so that runtime upgrades can be made (as required here) both now and in the future without user intervention.

Finally, a tiny (93 LOC, including header) executable that wraps the protocol encapsulation class is provided for 100% backwards compatibility of all command line options that currently exist.

This makes it a 100% drop-in replacement.

Performance of the affected unit tests is ~2.5x better on my laptop: 7.6 v 3.0 seconds.

Known weaknesses:

  • still isn't hardened against multiple processes indexing simultaneously. the added sqlite columns are a step in this direction, howeer
  • the DBus signal that was being emitted was removed and has not been replaced with anything yet. That is because the affected class is in baloo-widgets (which happens to be the only user of the relevant public class in the baloo file library) and I have not yet worked on that, pending this patch set being approved in principle. I am actually of the suspicion that keeping that data synchronized at all times is not needed .. but if it is, there are much better ways than DBus. Note that the DBus overhead was applied to every single file indexed even if Dolphin (the only user of this feature) wasn't running or had the metadata frame open.

The base idea behind this patch is to make file indexing as blazing fast as possible, which means using as few resources, CPU & memory, as possible. That ought to be the #1 priority for Baloo when it comes to indexing files, both to keep it away from user's notice on desktop class hardware as well as extend its relevance to smaller devices. The code in this patch is at least as clear (imho more so, actually) than what it replaces and was written for maintainability.

Testing

Indexed files, ran testsuite with 0 failures.

Diffs

  • src/file/extractor/app.cpp (PRE-CREATION)
  • src/file/extractor/baloo_file_extractor.h (a8cff51ae763c36fb2934c80c04cbf0f710a01c7)
  • src/file/extractor/baloo_file_extractor.cpp (b2e7f97f2f0537807c314e2aca008d67711a69ce)
  • src/file/extractor/extractorworker.cpp (e710908dc8b02b7edc9f40e44b9d5d5883986846)
  • src/file/extractor/main.cpp (606a1eb9d42b7079fee497a3fdffbe2327512c13)
  • src/file/fileindexingjob.h (PRE-CREATION)
  • src/file/fileindexingjob.cpp (PRE-CREATION)
  • src/file/fileindexingqueue.h (5b1c612feb84046acf36b3dec3bf890da2cbf650)
  • src/file/fileindexingqueue.cpp (c6479cd25308bee9bb6fe6e6791391e6ca83bddf)
  • src/file/indexingqueue.cpp (c8ae4d9d1436ebe222a4758bd5ce551e091ec714)
  • src/file/lib/CMakeLists.txt (fb158a5f00f07ab48de963bf72924c83fc1c6673)
  • src/file/lib/extractorclient.h (1e2ae09c7bd0068ca185b814c5c4996c0b563856)
  • src/file/lib/extractorclient.cpp (221b90abe1899c27c03cf77e279fb71afce80b4d)
  • src/file/lib/filemapping.cpp (5769fe23576c341917e8f37f7dd2743dd446a6e1)
  • src/file/main.cpp (41855215c7b550c403d5ef523d29f9b73db58cb0)
  • src/file/metadatamover.cpp (5117687c1247275235c3830b9d2234bfff353748)
  • src/file/util.h (a4cc2423fd46cda7146b02c572cc3f4ccd069e3e)
  • src/file/util.cpp (2f7ac6de0dddc7d71c2b36bb4fe88677bfe52336)
  • src/file/autotest/fileindexingjob/fileindexingjobtest.h (PRE-CREATION)
  • src/file/autotest/fileindexingjob/fileindexingjobtest.cpp (PRE-CREATION)
  • src/file/autotest/scheduler/schedulertest.cpp (b91a481138382b609068b4737c654ac5182c37ec)
  • src/file/basicindexingjob.h (3d2fb339e38be862332ddca45b4926f8000fa0c1)
  • src/file/basicindexingjob.cpp (8d70fa2b1ca737eaa8d2e717a0eb2dbf430fac84)
  • src/file/basicindexingqueue.cpp (8484c9b6868890cae6c2e19b29ea92d500aac03e)
  • src/file/commitqueue.cpp (b26f73cfdbc390cceef01aa6abd2b5af6f9689b2)
  • src/file/database.h (6548de7dbce003112dad8c58c39c624dd806a261)
  • src/file/database.cpp (5c9ccdc442d230bdc5dc9edbe5bc595537d08a8f)
  • src/file/extractor/CMakeLists.txt (7cf3478f547f7fbcdf8908a3471b50fab09b5220)
  • src/file/extractor/app.h (8059a16e8ec2e2a84d1a3b7eb1dd8cfc2a3136de)
  • .gitignore (1377554ebea6f98a2c748183bc5a96852af12ac2)
  • docs/file_extractor.md (3d830e2313bd35994f9fd1d6b52df422d0556fd8)
  • src/file/CMakeLists.txt (1a9e6bf8f2c70b096fbb2c6c9eb78feec9d06622)
  • src/file/autotest/CMakeLists.txt (ccb8575fa8c07f2b136d259ed63d43a09ad75a12)
  • src/file/autotest/basicindexingjobtest.h (PRE-CREATION)
  • src/file/autotest/basicindexingjobtest.cpp (PRE-CREATION)
  • src/file/autotest/fileindexingjob/CMakeLists.txt (PRE-CREATION)
  • src/file/autotest/fileindexingjob/extractor.cpp (PRE-CREATION)

View Diff

--===============6708298779760466460==-- --===============8330820678844667509== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline Cj4+IFZpc2l0IGh0dHA6Ly9tYWlsLmtkZS5vcmcvbWFpbG1hbi9saXN0aW5mby9rZGUtZGV2ZWwj dW5zdWIgdG8gdW5zdWJzY3JpYmUgPDwK --===============8330820678844667509==--