[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kopete-devel
Subject:    [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing errors,
From:       Thiago Macieira <thiagom () mail ! com>
Date:       2004-01-26 20:29:11
Message-ID: 20040126202911.1612.qmail () ktown ! kde ! org
[Download RAW message or body]

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
      
http://bugs.kde.org/show_bug.cgi?id=72917      
thiagom@mail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thiagom@mail.com



------- Additional Comments From thiagom@mail.com  2004-01-26 21:29 -------
> I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed
> so that it is unlikely that a non-utf-8 string can recognized as utf-8. If
> the UTF-decoder cannot fail, then what does it do when it encounters an
> illegal sequence? 
 
That's actually the very reason we're getting this problem, Robin.

Since Qt 3.2.x, TrollTech introduced a modification to its UTF-8 decoder in response \
to a bug report from us. The original problem was that files whose names or paths \
were not encodable in the user's selected locale could not be opened by KDE \
applications nor renamed in Konqueror. We had proposed a solution, but TrollTech \
chose instead to accept any input as valid UTF-8: when it sees an invalid sequence, \
it encodes the bytes as a pair of UTF-16 surrogates. The decoder then restores the \
original byte.

This renders the operation ToUTF8(FromUTF8(any_string)) == any_string true in every \
case. The side-effect: Latin1 and other kinds of strings are accepted in Qt as valid \
UTF-8, but other programs don't accept them (our XML parser being one of those).

> Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The
> difference is that a few characters that were "never" used were replaces by
> some that actually are used. 

That's ok in principle, but not so from the technical point of view. The \
Latin1-to-Unicode conversion is very simple and fast, since all Latin1's 256 \
codepoints map 1:1 to Unicode's first 256 codepoints. For Latin9 and any other \
encoding, a non-trivial conversion through table lookups must be performed.

> if( userCodec == QTextCodec::codecForName("utf") ) 
 
Please don't write that. That requires a codec lookup internally by QTextCodec. \
Instead, use userCodec->mibEnum() == 106 to detect the UTF-8 encoder.

A couple more opinions from me:
- trying UTF-8 before the user's locale:
Makes sense, since we may catch UTF-8 being used. The probability of someone writing \
valid text in another encoding and it being valid UTF-8 is very low.

- the user-selected codec fails decoding:
Decode as Latin1, but let the user know about this fact (a non-intrusive warning or \
an "bug" icon like Konqueror's for JavaScript errors).

Important: KStringHandler::isUtf8 rejects control characters, including ASCII 3 used \
by mIRC-colouring in IRC. _______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic