From kde-pim Wed Dec 07 20:50:33 2011 From: Ingo =?utf-8?q?Kl=C3=B6cker?= Date: Wed, 07 Dec 2011 20:50:33 +0000 To: kde-pim Subject: Re: [Kde-pim] Character encoding handling in KMail/KMime Message-Id: <201112072150.34232 () thufir ! ingo-kloecker ! de> X-MARC-Message: https://marc.info/?l=kde-pim&m=132329114704367 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--===============1760408112398616319==" --===============1760408112398616319== Content-type: multipart/signed; boundary=nextPart1547773.n96Ss6qnlu; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-transfer-encoding: 7bit --nextPart1547773.n96Ss6qnlu Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tuesday 29 November 2011, Andras Mantia wrote: > Ingo Kl=C3=B6cker wrote: > > On Monday 28 November 2011, Andras Mantia wrote: > >> Hi, > >>=20 > >> on the weekend I fixed a problem introduced with the HTML reply > >>=20 > >> branch, where the characters in the reply were wrongly encoded. > >> While trying to get around the code and see where do the strings > >> get broken I realized that it is just insane what we do there > >> with the character encodings. The messages are encoded, decoded > >> often (and now I refer both to charset encoding and string > >> encoding for transfer), while the message does not leave KMail > >> (or its libraries, like composer, templateparser) code. This also > >> includes of course a lot of QString->QByteArray and > >> QByteArray->QString covnersion. In my opinion there should be > >> exactly two conversion: when the message is read into KMime and > >> when it is written out from KMime (either to disc, send through > >> the network or someway else). For all the rest it should be just > >> unicode in a QString. > >>=20 > >> Of course this is an intrusive change and KDE 5 (or whatever > >> comes > >>=20 > >> next material). What do you think of it? Does it even makes sense > >> to go in this direction? Any better ideas? > >=20 > > This is difficult to answer because it depends on what the > > representation of the message is used for. Of course, it makes > > sense to reduce the conversions as much as possible, but I don't > > think that using a Unicode representation of the whole message > > makes sense. Do you mean the message text? >=20 > I was mainly refering to the message body, but user editable headers, > like from/to/subject have also the same issue. >=20 > > Actually, I think the message as a whole should never be serialized > > as Unicode string because it is never needed in this form. > > (Correct me if I'm wrong.) The message should either be in KMime > > or it should be serialized into a QByteArray (for storage on disk > > and sending). Apart from this only the body and the header content > > of individual message parts should be converted to Unicode for > > display, composing and other tasks where Unicode is really needed. >=20 > I think inside the app those parts should be unicode. Here is what > happens right now when you click on a message to view and reply to > it: - the message viewer gets a KMime::Message (that has a > QByteArray string and the encoding in the header). The body is > converted to an unicode string before it is displayed, based on the > encoding in the message or the override encoding specificed in the > settings. > - when replying, the KMime::Message is passed together with a > selection to the messagefactory to create the reply message. The > message is in the original encoding, the selection is in unicode > (comes from the viewer). - the message is passed to the template > parser. This creates an object tree parser, that like in case of the > viewer, converts the message to unicode. Then creates the reply > message content as a KMime::Content, where the reply (unicode) > string is converted to the charset selected to be the default for > the composer and put into the KMime::Content. > - if the "force original charset is used", the message is converted > back to unicode, the original charset applied, saved back into the > message - the message is passed to the composer window: this creates > an OTP, that again converts the message to unicode for displaying > - I cannot find right now where this happens, but at one point the > text from the editor (which is a QString AFAIK) is put back to the > KMime::Message in an encoded form and finally sent through the > network. Yeah. That's mostly as it was in KMail1. > This might make sense (individual parts always get a KMime::Message > that is a real representation of an email, so it is not always > unicode), but I find it to be: > - suboptimal (too many conversions) > - fragile. The encoding can go wrong in any place and it is hard to > find out where this happened. The last bug was that the template > parser saved back the data as unicode with a non-unicode encoding > header. Then furthermore when this was converted to unicode, a > double conversion was performed. Well, the advantage is that there is a well-defined interface between=20 KMail's different components. This interface is KMime. > My raw idea is that KMime stores itself every string as unicode and > applies the encoding only through some special methods, like: > - setBodyFromBytearray() - applies the encoding in the header and > stores inside as unicode > - decodedMessage() (or named something like that) - returns a > QByteArray in the right encoding, specified in the header. This > would return the whole assembled message. > - we could have similar convenience methods to get the > body/headers/message in the original encoding. Similar like now we > has asUnicodeString() and as7BitString(). >=20 > So the difference is how KMime stores internally the message. >=20 > I hope now it is more clear what would be my idea. Do you still think > this is bad? Yes. It will require KMime to treat text/* parts differently from non- text parts. I think this will make KMime unnecessarily complex.=20 Moreover, this would require the content to be decoded each time the=20 KMime-structure of a message is created. This could easily become a more=20 serious performance problem than the current one. OTOH, maybe it can be solved without making KMime (the API) much more=20 complex by making all entities available in an encoded variant and a=20 decoded variant both implementing the same interface. Or maybe both=20 variants are better hidden transparently behind the same interface using=20 composition. KMime could transparently switch between both internal=20 variants and keep both or only the most recently used variant in memory=20 depending on the general memory consumption. This is similar to your=20 idea, but avoid any unnecessary conversions because all conversions=20 would be done lazily on demand (as it's done now) but with additional=20 internal caching of the conversion results. The advantages of such a=20 solution are that the API stays as it is (so no using code needs to be=20 changed) and that it can be implemented separately for each entity. All of this is just brainstorming. I didn't have a closer look at KMime=20 to see whether this is a feasible approach. Regards, Ingo --nextPart1547773.n96Ss6qnlu Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) iEYEABECAAYFAk7f0ZoACgkQGnR+RTDgudhzYACgyrJY707FG5uj4sn9uXOpvdZc eW4AnRDFv9NM15S2UE6nINo31VsZNYpd =45Jm -----END PGP SIGNATURE----- --nextPart1547773.n96Ss6qnlu-- --===============1760408112398616319== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ KDE PIM mailing list kde-pim@kde.org https://mail.kde.org/mailman/listinfo/kde-pim KDE PIM home page at http://pim.kde.org/ --===============1760408112398616319==--