[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kopete-devel
Subject:    Re: [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing
From:       Thiago Macieira <thiago.macieira () kdemail ! net>
Date:       2004-01-23 19:38:44
Message-ID: 200401231738.51582.thiago.macieira () kdemail ! net
[Download RAW message or body]

[Attachment #2 (multipart/signed)]


Martijn Klingens wrote:
>On Friday 23 January 2004 19:48, Thiago Macieira wrote:
>How about the heuristics I just suggested, but instead of using
> Jason's code as final resort like I proposed use ...

Your heuristics:
>- Decode as utf8. If isUtf8() is available, use it and continue if it 
fails.
>  Otherwise we have to assume it's utf8 and continue at the XSLT part 
below.
>
>- Decode as latin1.
>
>- If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local
>  encoding is neither utf8 nor latin1.

Latin 1 decoding can't fail, so this will never be reached.

>
>- When all these failed, use your code that replaces invalid chars with
>  question marks. Since we're doing it *here* that means the whole 
dreaded
>  'should never happen' XML error indeed no longer happens at all.

What are "invalid characters" here? Those I listed below?

>> - finally, clean up U+0000 to U+001F (excepting, maybe, newlines,
>> etc.) (note: no locale nor UTF-8 if preferred is given)
>
>... this and ...
>
>> Finally, instead of QChar('?'), I'd recommend QChar::replacement.
>
>... this?

Finally, I agree with you that the UTF-8 decoding has to be 
special-cased, if it's one of the locale or preferred codecs.

In any event, I believe Kopete as a whole would benefit from a global 
function in libkopete that does the decoding and cleaning up of the 
input.

It would be something like (based off Jason's suggestion):
QString KopeteMessage::decodeString(QCString input, QTextCodec 
*preferredCodec = 0L)
{
	QString result;

	if (preferredCodec)
		result = decodeStringCodec(input, preferredCodec);
	else
	{
		result = decodeStringCodec(input, QTextCodec::codecForName("utf-8"));
		if (!result.isNull())
			result = decodeStringCodec(input, QTextCodec::codecForLocale());
	}

	// fallback
	// Latin-1 can't fail decoding
	if (!result.isNull())
		result = decodeStringCodec(input,
			QTextCodec::codecForName("latin-1"));

	return result;
}

/*private*/ 
QString KopeteMessage::decodeStringCodec(QCString input, QTextCodec 
*codec)
{
	// sanity check
	if (codec == 0L)
		return QString::null;

	// special case: UTF-8
	// KStringHandler::isUtf8 doesn't allocate memory to validate the input
	// whereas the code below (doing encoding and decoding) is much more
	// expensive
	// Besides, this works around a Qt "misfeature" that allows the UTF-8
	// decoder decode any arbitrary input.
	if (codec->mibEnum() == 106 && !KStringHandler::isUtf8(input))
		return QString::null;

	QString unicode = codec->toUnicode(input);
	if (input != codec->fromUnicode(unicode))
		// decoding didn't work
		return QString::null;

	// now clean up: remove control characters (U+0000 to U+001F)
	// except for the line breakers
	for (int i = 0; i < unicode.length(); i++)
		if (unicode[i] < ' ' && 
		    unicode[i] != '\r' && unicode[i] != '\n')
			unicode[i] = QChar::replacement;

	return unicode;
}


-- 
  Thiago Macieira  -  Registered Linux user #65028
   thiagom (AT) mail (dot) com
    ICQ UIN: 1967141   PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

[Attachment #5 (application/pgp-signature)]

_______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic