From kopete-devel  Mon Jan 26 20:29:11 2004
From: Thiago Macieira <thiagom () mail ! com>
Date: Mon, 26 Jan 2004 20:29:11 +0000
To: kopete-devel
Subject: [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing errors,
Message-Id: <20040126202911.1612.qmail () ktown ! kde ! org>
X-MARC-Message: https://marc.info/?l=kopete-devel&m=107514895822960

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
      
http://bugs.kde.org/show_bug.cgi?id=72917      
thiagom@mail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thiagom@mail.com



------- Additional Comments From thiagom@mail.com  2004-01-26 21:29 -------
> I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed
> so that it is unlikely that a non-utf-8 string can recognized as utf-8. If
> the UTF-decoder cannot fail, then what does it do when it encounters an
> illegal sequence? 
 
That's actually the very reason we're getting this problem, Robin.

Since Qt 3.2.x, TrollTech introduced a modification to its UTF-8 decoder in response to a bug report from us. The original problem was that files whose names or paths were not encodable in the user's selected locale could not be opened by KDE applications nor renamed in Konqueror. We had proposed a solution, but TrollTech chose instead to accept any input as valid UTF-8: when it sees an invalid sequence, it encodes the bytes as a pair of UTF-16 surrogates. The decoder then restores the original byte.

This renders the operation ToUTF8(FromUTF8(any_string)) == any_string true in every case. The side-effect: Latin1 and other kinds of strings are accepted in Qt as valid UTF-8, but other programs don't accept them (our XML parser being one of those).

> Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The
> difference is that a few characters that were "never" used were replaces by
> some that actually are used. 

That's ok in principle, but not so from the technical point of view. The Latin1-to-Unicode conversion is very simple and fast, since all Latin1's 256 codepoints map 1:1 to Unicode's first 256 codepoints. For Latin9 and any other encoding, a non-trivial conversion through table lookups must be performed.

> if( userCodec == QTextCodec::codecForName("utf") ) 
 
Please don't write that. That requires a codec lookup internally by QTextCodec. Instead, use userCodec->mibEnum() == 106 to detect the UTF-8 encoder.

A couple more opinions from me:
- trying UTF-8 before the user's locale:
Makes sense, since we may catch UTF-8 being used. The probability of someone writing valid text in another encoding and it being valid UTF-8 is very low.

- the user-selected codec fails decoding:
Decode as Latin1, but let the user know about this fact (a non-intrusive warning or an "bug" icon like Konqueror's for JavaScript errors).

Important: KStringHandler::isUtf8 rejects control characters, including ASCII 3 used by mIRC-colouring in IRC.
_______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel