From kde-pim Thu Dec 18 13:34:58 2014 From: Daniel =?ISO-8859-1?Q?Vr=E1til?= Date: Thu, 18 Dec 2014 13:34:58 +0000 To: kde-pim Subject: Re: [Kde-pim] akonadinext update: entity processing pipelines in resources Message-Id: <1442037.6kDXILcMDX () thor> X-MARC-Message: https://marc.info/?l=kde-pim&m=141890971926015 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--===============4439056067347330316==" --===============4439056067347330316== Content-Type: multipart/signed; boundary="nextPart1621279.YzAlAXg5pY"; micalg="pgp-sha1"; protocol="application/pgp-signature" --nextPart1621279.YzAlAXg5pY Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" On Thursday, December 18, 2014 09:55:30 AM Aaron J. Seigo wrote: > On Wednesday, December 17, 2014 14.49:53 you wrote: > > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote: > > > currently, pipelines are just a simple one-after-the-other proces= sing > > > afair. It is set up already for asynchronous processing, however.= > > > Eventually I would like to allow filters to note that they can be= > > > parallelized, should be run in a separate thread, ??? ... mostly = so that > > > we can increase throughput. > >=20 > > This, imo, will kill user-configuration. You do not want to burden = the > > user > > with a GUI where he can define dependencies etc. pp. >=20 > Yes, that would make no sense. So that's probably not what I'm propos= ing :) >=20 > > Also, I cannot think of any common use-case of mail filtering that = could > > be >=20 > > parallelized for a single mail: > Christian already pointed to it, but: >=20 > =09https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology >=20 > Mail filtering is a specific use case, but the abstract concept is > "processing an entity for content". Evidently the word "filter" is ca= using > confusion, and that's perhaps understandable since the word has meani= ng in > the scope of email. (.. and of course, Akonadi is not, strictly speak= ing, > even an email system; it's a system that can be used to manage email = stores > ..) >=20 > Better suggestions for the word "filter" are welcome. We are early en= ough in > that we can change these terms. >=20 > > What other, _common_ usecase do you think of that would benefit fro= m the > > additional design overhead? >=20 > The point of having pipelines is to ensure all post-delivery processi= ng is > done before clients start showing (wrong) data. Filters that move an = email > between folders, for instance, should be run *before* showing the ema= il in > the wrong folder in the client. >=20 > So, real world use cases: >=20 > 1. a mail filter that moves an email to a folder > 2. a scam detector (currently this lives in libmessageviewer!) > 3.full text indexer > 4. threading agent (relies on knowing which folder it is in) > 5. a mail filter that flags mails from your boss as important > 6. an event checker that flags conflicts between incoming events and > existing ones >=20 > 1, 2, 3, 4 and 6 do not modify the entity itself. They touch indexes,= but > not the entity itself. Number 5 does. >=20 > Number 1 needs to be run before numbers 3 and 4, but can be run in pa= rallel > with 2 and 5 (which also needs to be run before 3 and 4). 3 and 4 can= be run > in parallel. 6 may run on emails and on calender events, does not tou= ch the > entities, nothing depends on its output. >=20 > the graph that comes from that is self-evident once all the informati= on is > known .. but that's the trick: making sure each element can provide e= nough > useful, machine-processable information to know what the graph should= be. >=20 > as for user configuration, they may wish to not have scam detection o= n > (e.g.). with that off, then the set of filters that are run change (i= n this > case #2 is just not run at all) and the graph changes as a result as = well. >=20 > additionally, 1 and 5 are obviously generated from user configuration= . the > user won't know that, but that is what will be happening: their filte= rs will > be creating nodes in the pipeline. >=20 > as for why to parallelize, that's simple: throughput. I think we should think here about what the scope of paralellization sh= ould=20 be: do we want to run a single email instance through multiple filters = in=20 parallel, or do we want to process multiple emails at once in parallel=20= pipelines? I think that trying to run multiple filters on one email in parallel do= es not=20 make much sense, and unless you have real hard numbers to back this up,= the=20 performance gain does not simply outweight the complexity of the code t= o=20 manage the filters graph (to detect which can be executed in parallel, = and=20 when). This will not improve the throughput. On the other hand, we really want to be able to process multiple emails= in=20 parallel - for instance during sync. Having 4 or so identical pipelines= =20 running in threads and distributing incoming emails between them evenly= would=20 be a massive performance boost IMO. It would also reduce the complexity= of the=20 filter-management code, as you would have only 3 types of filters: * Pre-pipeline filters - filters that each entity has to pass before en= tering=20 the pipeline. There is only one instance of each filter, and it is not=20= parallelized. This has to be a super-fast filter. I listed it mostly ju= st so=20 that the list is complete. The only case I can think of is balancing th= e=20 incoming entities between the paralellized pipelines. * Pipeline filters - the filters are simply chained (=3D pipeline) - th= ere are=20 multiple instances of the pipeline, each instance has it's thread. This= =20 handles indexing, mail filtering, etc. * Post-pipeline filters - same as pre-pipeline filters, just executed a= fter=20 the entity leaves the pipeline. Could be the threading filter for examp= le. All you need to specify for each filter is it's type (Pre, Pipeline, Po= st) and=20 it's weight to enforce order of the filters in the chain (e.g. mail fil= ter=20 filter (see why I prefer "preprocessor" to "filter" here? :D) should be= before=20 indexer, etc.). Dan >=20 > as you note, we should be able to parallelize processing of individua= l > emails, but even then only to an extent. the threading agent is much > simpler if it is only ever processing one email at a time, so maybe w= e > never want it to be running in parallel, which the scam detector perh= aps > ought to be running in as many individual pipelines as possible at on= ce. >=20 > additionally, some processes take more time than others and block yet= > others. runing 1, 2, 3 and 6 in parallel will gut us to 4, 5 that muc= h > faster. throughput, plain and simple. >=20 > we are thinking about all of these issues with datasets of 100s of 10= 00s of > folders / emails in a single collection in mind. Kolab Systems has cl= ients > with exactly such data sets, in fact. >=20 > hope that helps clear up some things. if not, keep asking :) =2D-=20 Daniel Vr=E1til | dvratil@redhat.com | dvratil on #kde-devel, #kontact,= #akonadi Software Engineer - KDE Desktop Team, Red Hat Inc. GPG Key: 0xC59D614F6F4AE348 Fingerprint: 4EC1 86E3 C54E 0B39 5FDD B5FB C59D 614F 6F4A E348 --nextPart1621279.YzAlAXg5pY Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUktgCAAoJEMWdYU9vSuNIjQMIAJ84xjaL6PZJ/9j/QekDEbQs VmivHcBtz6JWrjTDizdIPEQUqQnQ/o/+xnaMnOknt60uyDhETxrSwxYqscqzZcki oGJwtDQKB967QGCyjIspRoUygfTXBqbKzHB0Zvt2b75NMcJjIBVQaf5csjEG4zWG 8ynVcEsE12iJ+fwHPHBCKXAkanXjQ3FZDp9ear8VH5CT7YduJkywzEruGUSkCUQz qv5X7PQ9QSU6mOS+T6LjlQewtV8ojhndfJHIVsYtyPvGID6uOVMgSNA+yxzqFLz7 wvi2And7q+aQG0d0yIuPpqW5CtMrJEdVH9u6y8c+bk+95HB8iIprV3R8MgfW02k= =cIma -----END PGP SIGNATURE----- --nextPart1621279.YzAlAXg5pY-- --===============4439056067347330316== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ KDE PIM mailing list kde-pim@kde.org https://mail.kde.org/mailman/listinfo/kde-pim KDE PIM home page at http://pim.kde.org/ --===============4439056067347330316==--