[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kopete-devel
Subject:    [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing errors,
From:       Thiago Macieira <thiagom () mail ! com>
Date:       2004-01-26 20:29:11
Message-ID: 20040126202911.1612.qmail () ktown ! kde ! org
[Download RAW message or body]

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
      
http://bugs.kde.org/show_bug.cgi?id=72917      
thiagom@mail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |thiagom@mail.com



------- Additional Comments From thiagom@mail.com  2004-01-26 21:29 -------
> I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed
> so that it is unlikely that a non-utf-8 string can recognized as utf-8. If
> the UTF-decoder cannot fail, then what does it do when it encounters an
> illegal sequence? 
 
That's actually the very reason we're getting this problem, Robin.

Since Qt 3.2.x, TrollTech introduced a modification to its UTF-8 decoder in response to a bug report from \
us. The original problem was that files whose names or paths were not encodable in the user's selected \
locale could not be opened by KDE applications nor renamed in Konqueror. We had proposed a solution, but \
TrollTech chose instead to accept any input as valid UTF-8: when it sees an invalid sequence, it encodes \
the bytes as a pair of UTF-16 surrogates. The decoder then restores the original byte.

This renders the operation ToUTF8(FromUTF8(any_string)) == any_string true in every case. The \
side-effect: Latin1 and other kinds of strings are accepted in Qt as valid UTF-8, but other programs \
don't accept them (our XML parser being one of those).

> Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The
> difference is that a few characters that were "never" used were replaces by
> some that actually are used. 

That's ok in principle, but not so from the technical point of view. The Latin1-to-Unicode conversion is \
very simple and fast, since all Latin1's 256 codepoints map 1:1 to Unicode's first 256 codepoints. For \
Latin9 and any other encoding, a non-trivial conversion through table lookups must be performed.

> if( userCodec == QTextCodec::codecForName("utf") ) 
 
Please don't write that. That requires a codec lookup internally by QTextCodec. Instead, use \
userCodec->mibEnum() == 106 to detect the UTF-8 encoder.

A couple more opinions from me:
- trying UTF-8 before the user's locale:
Makes sense, since we may catch UTF-8 being used. The probability of someone writing valid text in \
another encoding and it being valid UTF-8 is very low.

- the user-selected codec fails decoding:
Decode as Latin1, but let the user know about this fact (a non-intrusive warning or an "bug" icon like \
Konqueror's for JavaScript errors).

Important: KStringHandler::isUtf8 rejects control characters, including ASCII 3 used by mIRC-colouring in \
IRC. _______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic