From kde-pim  Wed Dec 07 20:50:33 2011
From: Ingo =?utf-8?q?Kl=C3=B6cker?= <kloecker () kde ! org>
Date: Wed, 07 Dec 2011 20:50:33 +0000
To: kde-pim
Subject: Re: [Kde-pim] Character encoding handling in KMail/KMime
Message-Id: <201112072150.34232 () thufir ! ingo-kloecker ! de>
X-MARC-Message: https://marc.info/?l=kde-pim&m=132329114704367
MIME-Version: 1
Content-Type: multipart/mixed; boundary="--===============1760408112398616319=="

--===============1760408112398616319==
Content-type: multipart/signed; boundary=nextPart1547773.n96Ss6qnlu;
 protocol="application/pgp-signature"; micalg=pgp-sha1
Content-transfer-encoding: 7bit

--nextPart1547773.n96Ss6qnlu
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable

On Tuesday 29 November 2011, Andras Mantia wrote:
> Ingo Kl=C3=B6cker wrote:
> > On Monday 28 November 2011, Andras Mantia wrote:
> >> Hi,
> >>=20
> >>  on the weekend I fixed a problem introduced with the HTML reply
> >>=20
> >> branch, where the characters in the reply were wrongly encoded.
> >> While trying to get around the code and see where do the strings
> >> get broken I realized that it is just insane what we do there
> >> with the character encodings. The messages are encoded, decoded
> >> often (and now I refer both to charset encoding and string
> >> encoding for transfer), while the message does not leave KMail
> >> (or its libraries, like composer, templateparser) code. This also
> >> includes of course a lot of QString->QByteArray and
> >> QByteArray->QString covnersion. In my opinion there should be
> >> exactly two conversion: when the message is read into KMime and
> >> when it is written out from KMime (either to disc, send through
> >> the network or someway else). For all the rest it should be just
> >> unicode in a QString.
> >>=20
> >>  Of course this is an intrusive change and KDE 5 (or whatever
> >>  comes
> >>=20
> >> next material). What do you think of it? Does it even makes sense
> >> to go in this direction? Any better ideas?
> >=20
> > This is difficult to answer because it depends on what the
> > representation of the message is used for. Of course, it makes
> > sense to reduce the conversions as much as possible, but I don't
> > think that using a Unicode representation of the whole message
> > makes sense. Do you mean the message text?
>=20
> I was mainly refering to the message body, but user editable headers,
> like from/to/subject have also the same issue.
>=20
> > Actually, I think the message as a whole should never be serialized
> > as Unicode string because it is never needed in this form.
> > (Correct me if I'm wrong.) The message should either be in KMime
> > or it should be serialized into a QByteArray (for storage on disk
> > and sending). Apart from this only the body and the header content
> > of individual message parts should be converted to Unicode for
> > display, composing and other tasks where Unicode is really needed.
>=20
> I think inside the app those parts should be unicode. Here is what
> happens right now when you click on a message to view and reply to
> it: - the message viewer gets a KMime::Message (that has a
> QByteArray string and the encoding in the header). The body is
> converted to an unicode string before it is displayed, based on the
> encoding in the message or the override encoding specificed in the
> settings.
> - when replying, the KMime::Message is passed together with a
> selection to the messagefactory to create the reply message. The
> message is in the original encoding, the selection is in unicode
> (comes from the viewer). - the message is passed to the template
> parser. This creates an object tree parser, that like in case of the
> viewer, converts the message to unicode. Then creates the reply
> message content as a KMime::Content, where the reply (unicode)
> string is converted to the charset selected to be the default for
> the composer and put into the KMime::Content.
> - if the "force original charset is used", the message is converted
> back to unicode, the original charset applied, saved back into the
> message - the message is passed to the composer window: this creates
> an OTP, that again converts the message to unicode for displaying
> - I cannot find right now where this happens, but at one point the
> text from the editor (which is a QString AFAIK) is put back to the
> KMime::Message in an encoded form and finally sent through the
> network.

Yeah. That's mostly as it was in KMail1.


> This might make sense (individual parts always get a KMime::Message
> that is a real representation of an email, so it is not always
> unicode), but I find it to be:
> - suboptimal (too many conversions)
> - fragile. The encoding can go wrong in any place and it is hard to
> find out where this happened. The last bug was that the template
> parser saved back the data as unicode with a non-unicode encoding
> header. Then furthermore when this was converted to unicode, a
> double conversion was performed.

Well, the advantage is that there is a well-defined interface between=20
KMail's different components. This interface is KMime.


> My raw idea is that KMime stores itself every string as unicode and
> applies the encoding only through some special methods, like:
> - setBodyFromBytearray() - applies the encoding in the header and
> stores inside as unicode
> - decodedMessage() (or named something like that) - returns a
> QByteArray in the right encoding, specified in the header. This
> would return the whole assembled message.
> - we could have similar convenience methods to get the
> body/headers/message in the original encoding. Similar like now we
> has asUnicodeString() and as7BitString().
>=20
> So the difference is how KMime stores internally the message.
>=20
> I hope now it is more clear what would be my idea. Do you still think
> this is bad?

Yes. It will require KMime to treat text/* parts differently from non-
text parts. I think this will make KMime unnecessarily complex.=20
Moreover, this would require the content to be decoded each time the=20
KMime-structure of a message is created. This could easily become a more=20
serious performance problem than the current one.

OTOH, maybe it can be solved without making KMime (the API) much more=20
complex by making all entities available in an encoded variant and a=20
decoded variant both implementing the same interface. Or maybe both=20
variants are better hidden transparently behind the same interface using=20
composition. KMime could transparently switch between both internal=20
variants and keep both or only the most recently used variant in memory=20
depending on the general memory consumption. This is similar to your=20
idea, but avoid any unnecessary conversions because all conversions=20
would be done lazily on demand (as it's done now) but with additional=20
internal caching of the conversion results. The advantages of such a=20
solution are that the API stays as it is (so no using code needs to be=20
changed) and that it can be implemented separately for each entity.

All of this is just brainstorming. I didn't have a closer look at KMime=20
to see whether this is a feasible approach.


Regards,
Ingo

--nextPart1547773.n96Ss6qnlu
Content-Type: application/pgp-signature; name=signature.asc 
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (GNU/Linux)

iEYEABECAAYFAk7f0ZoACgkQGnR+RTDgudhzYACgyrJY707FG5uj4sn9uXOpvdZc
eW4AnRDFv9NM15S2UE6nINo31VsZNYpd
=45Jm
-----END PGP SIGNATURE-----

--nextPart1547773.n96Ss6qnlu--

--===============1760408112398616319==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
KDE PIM mailing list kde-pim@kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/
--===============1760408112398616319==--