From kde-pim  Thu Dec 18 13:34:58 2014
From: Daniel =?ISO-8859-1?Q?Vr=E1til?= <dvratil () redhat ! com>
Date: Thu, 18 Dec 2014 13:34:58 +0000
To: kde-pim
Subject: Re: [Kde-pim] akonadinext update: entity processing pipelines in resources
Message-Id: <1442037.6kDXILcMDX () thor>
X-MARC-Message: https://marc.info/?l=kde-pim&m=141890971926015
MIME-Version: 1
Content-Type: multipart/mixed; boundary="--===============4439056067347330316=="


--===============4439056067347330316==
Content-Type: multipart/signed; boundary="nextPart1621279.YzAlAXg5pY"; micalg="pgp-sha1"; protocol="application/pgp-signature"


--nextPart1621279.YzAlAXg5pY
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"

On Thursday, December 18, 2014 09:55:30 AM Aaron J. Seigo wrote:
> On Wednesday, December 17, 2014 14.49:53 you wrote:
> > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > currently, pipelines are just a simple one-after-the-other proces=
sing
> > > afair. It is set up already for asynchronous processing, however.=

> > > Eventually I would like to allow filters to note that they can be=

> > > parallelized, should be run in a separate thread, ??? ... mostly =
so that
> > > we can increase throughput.
> >=20
> > This, imo, will kill user-configuration. You do not want to burden =
the
> > user
> > with a GUI where he can define dependencies etc. pp.
>=20
> Yes, that would make no sense. So that's probably not what I'm propos=
ing :)
>=20
> > Also, I cannot think of any common use-case of mail filtering that =
could
> > be
>=20
> > parallelized for a single mail:
> Christian already pointed to it, but:
>=20
> =09https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology
>=20
> Mail filtering is a specific use case, but the abstract concept is
> "processing an entity for content". Evidently the word "filter" is ca=
using
> confusion, and that's perhaps understandable since the word has meani=
ng in
> the scope of email. (.. and of course, Akonadi is not, strictly speak=
ing,
> even an email system; it's a system that can be used to manage email =
stores
> ..)
>=20
> Better suggestions for the word "filter" are welcome. We are early en=
ough in
> that we can change these terms.
>=20
> > What other, _common_ usecase do you think of that would benefit fro=
m the
> > additional design overhead?
>=20
> The point of having pipelines is to ensure all post-delivery processi=
ng is
> done before clients start showing (wrong) data. Filters that move an =
email
> between folders, for instance, should be run *before* showing the ema=
il in
> the wrong folder in the client.
>=20
> So, real world use cases:
>=20
> 1. a mail filter that moves an email to a folder
> 2. a scam detector (currently this lives in libmessageviewer!)
> 3.full text indexer
> 4. threading agent (relies on knowing which folder it is in)
> 5. a mail filter that flags mails from your boss as important
> 6. an event checker that flags conflicts between incoming events and
> existing ones
>=20
> 1, 2, 3, 4 and 6 do not modify the entity itself. They touch indexes,=
 but
> not the entity itself. Number 5 does.
>=20
> Number 1 needs to be run before numbers 3 and 4, but can be run in pa=
rallel
> with 2 and 5 (which also needs to be run before 3 and 4). 3 and 4 can=
 be run
> in parallel. 6 may run on emails and on calender events, does not tou=
ch the
> entities, nothing depends on its output.
>=20
> the graph that comes from that is self-evident once all the informati=
on is
> known .. but that's the trick: making sure each element can provide e=
nough
> useful, machine-processable information to know what the graph should=
 be.
>=20
> as for user configuration, they may wish to not have scam detection o=
n
> (e.g.). with that off, then the set of filters that are run change (i=
n this
> case #2 is just not run at all) and the graph changes as a result as =
well.
>=20
> additionally, 1 and 5 are obviously generated from user configuration=
. the
> user won't know that, but that is what will be happening: their filte=
rs will
> be creating nodes in the pipeline.
>=20
> as for why to parallelize, that's simple: throughput.

I think we should think here about what the scope of paralellization sh=
ould=20
be: do we want to run a single email instance through multiple filters =
in=20
parallel, or do we want to process multiple emails at once in parallel=20=

pipelines?

I think that trying to run multiple filters on one email in parallel do=
es not=20
make much sense, and unless you have real hard numbers to back this up,=
 the=20
performance gain does not simply outweight the complexity of the code t=
o=20
manage the filters graph (to detect which can be executed in parallel, =
and=20
when). This will not improve the throughput.

On the other hand, we really want to be able to process multiple emails=
 in=20
parallel - for instance during sync. Having 4 or so identical pipelines=
=20
running in threads and distributing incoming emails between them evenly=
 would=20
be a massive performance boost IMO. It would also reduce the complexity=
 of the=20
filter-management code, as you would have only 3 types of filters:

* Pre-pipeline filters - filters that each entity has to pass before en=
tering=20
the pipeline. There is only one instance of each filter, and it is not=20=

parallelized. This has to be a super-fast filter. I listed it mostly ju=
st so=20
that the list is complete. The only case I can think of is balancing th=
e=20
incoming entities between the paralellized pipelines.

* Pipeline filters - the filters are simply chained (=3D pipeline) - th=
ere are=20
multiple instances of the pipeline, each instance has it's thread. This=
=20
handles indexing, mail filtering, etc.

* Post-pipeline filters - same as pre-pipeline filters, just executed a=
fter=20
the entity leaves the pipeline. Could be the threading filter for examp=
le.

All you need to specify for each filter is it's type (Pre, Pipeline, Po=
st) and=20
it's weight to enforce order of the filters in the chain (e.g. mail fil=
ter=20
filter (see why I prefer "preprocessor" to "filter" here? :D) should be=
 before=20
indexer, etc.).


Dan

>=20
> as you note, we should be able to parallelize processing of individua=
l
> emails, but even then only to an extent. the threading agent is much
> simpler if it is only ever processing one email at a time, so maybe w=
e
> never want it to be running in parallel, which the scam detector perh=
aps
> ought to be running in as many individual pipelines as possible at on=
ce.
>=20
> additionally, some processes take more time than others and block yet=

> others. runing 1, 2, 3 and 6 in parallel will gut us to 4, 5 that muc=
h
> faster. throughput, plain and simple.
>=20
> we are thinking about all of these issues with datasets of 100s of 10=
00s of
> folders / emails in a single collection in mind. Kolab Systems has cl=
ients
> with exactly such data sets, in fact.
>=20
> hope that helps clear up some things. if not, keep asking :)

=2D-=20
Daniel Vr=E1til | dvratil@redhat.com | dvratil on #kde-devel, #kontact,=
 #akonadi
Software Engineer - KDE Desktop Team, Red Hat Inc.

GPG Key: 0xC59D614F6F4AE348
Fingerprint: 4EC1 86E3 C54E 0B39 5FDD B5FB C59D 614F 6F4A E348
--nextPart1621279.YzAlAXg5pY
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABAgAGBQJUktgCAAoJEMWdYU9vSuNIjQMIAJ84xjaL6PZJ/9j/QekDEbQs
VmivHcBtz6JWrjTDizdIPEQUqQnQ/o/+xnaMnOknt60uyDhETxrSwxYqscqzZcki
oGJwtDQKB967QGCyjIspRoUygfTXBqbKzHB0Zvt2b75NMcJjIBVQaf5csjEG4zWG
8ynVcEsE12iJ+fwHPHBCKXAkanXjQ3FZDp9ear8VH5CT7YduJkywzEruGUSkCUQz
qv5X7PQ9QSU6mOS+T6LjlQewtV8ojhndfJHIVsYtyPvGID6uOVMgSNA+yxzqFLz7
wvi2And7q+aQG0d0yIuPpqW5CtMrJEdVH9u6y8c+bk+95HB8iIprV3R8MgfW02k=
=cIma
-----END PGP SIGNATURE-----

--nextPart1621279.YzAlAXg5pY--


--===============4439056067347330316==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
KDE PIM mailing list kde-pim@kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/
--===============4439056067347330316==--