[prev in list] [next in list] [prev in thread] [next in thread]
List: kopete-devel
Subject: Re: [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing
From: Thiago Macieira <thiago.macieira () kdemail ! net>
Date: 2004-01-23 19:38:44
Message-ID: 200401231738.51582.thiago.macieira () kdemail ! net
[Download RAW message or body]
[Attachment #2 (multipart/signed)]
Martijn Klingens wrote:
>On Friday 23 January 2004 19:48, Thiago Macieira wrote:
>How about the heuristics I just suggested, but instead of using
> Jason's code as final resort like I proposed use ...
Your heuristics:
>- Decode as utf8. If isUtf8() is available, use it and continue if it
fails.
> Otherwise we have to assume it's utf8 and continue at the XSLT part
below.
>
>- Decode as latin1.
>
>- If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local
> encoding is neither utf8 nor latin1.
Latin 1 decoding can't fail, so this will never be reached.
>
>- When all these failed, use your code that replaces invalid chars with
> question marks. Since we're doing it *here* that means the whole
dreaded
> 'should never happen' XML error indeed no longer happens at all.
What are "invalid characters" here? Those I listed below?
>> - finally, clean up U+0000 to U+001F (excepting, maybe, newlines,
>> etc.) (note: no locale nor UTF-8 if preferred is given)
>
>... this and ...
>
>> Finally, instead of QChar('?'), I'd recommend QChar::replacement.
>
>... this?
Finally, I agree with you that the UTF-8 decoding has to be
special-cased, if it's one of the locale or preferred codecs.
In any event, I believe Kopete as a whole would benefit from a global
function in libkopete that does the decoding and cleaning up of the
input.
It would be something like (based off Jason's suggestion):
QString KopeteMessage::decodeString(QCString input, QTextCodec
*preferredCodec = 0L)
{
QString result;
if (preferredCodec)
result = decodeStringCodec(input, preferredCodec);
else
{
result = decodeStringCodec(input, QTextCodec::codecForName("utf-8"));
if (!result.isNull())
result = decodeStringCodec(input, QTextCodec::codecForLocale());
}
// fallback
// Latin-1 can't fail decoding
if (!result.isNull())
result = decodeStringCodec(input,
QTextCodec::codecForName("latin-1"));
return result;
}
/*private*/
QString KopeteMessage::decodeStringCodec(QCString input, QTextCodec
*codec)
{
// sanity check
if (codec == 0L)
return QString::null;
// special case: UTF-8
// KStringHandler::isUtf8 doesn't allocate memory to validate the input
// whereas the code below (doing encoding and decoding) is much more
// expensive
// Besides, this works around a Qt "misfeature" that allows the UTF-8
// decoder decode any arbitrary input.
if (codec->mibEnum() == 106 && !KStringHandler::isUtf8(input))
return QString::null;
QString unicode = codec->toUnicode(input);
if (input != codec->fromUnicode(unicode))
// decoding didn't work
return QString::null;
// now clean up: remove control characters (U+0000 to U+001F)
// except for the line breakers
for (int i = 0; i < unicode.length(); i++)
if (unicode[i] < ' ' &&
unicode[i] != '\r' && unicode[i] != '\n')
unicode[i] = QChar::replacement;
return unicode;
}
--
Thiago Macieira - Registered Linux user #65028
thiagom (AT) mail (dot) com
ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
[Attachment #5 (application/pgp-signature)]
_______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic