From kopete-devel  Mon Jan 26 15:18:02 2004
From: Robin Rosenberg <robin.rosenberg () dewire ! com>
Date: Mon, 26 Jan 2004 15:18:02 +0000
To: kopete-devel
Subject: Re: [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing
Message-Id: <200401261618.03003.robin.rosenberg () dewire ! com>
X-MARC-Message: https://marc.info/?l=kopete-devel&m=107513070529714

fredagen den 23 januari 2004 21.40 skrev Jason Keirstead:
> On January 23, 2004 4:20 pm, Martijn Klingens wrote:
[....]
> > - make sure that whenever Utf8 is being used isUtf8() is called first and
> > if it fails forget about using Utf8
> No. See, this is the problem. You are assuming that you should try UTF then if 
> UTF fails then you'll be able to guess something.
> This is backwards. UTF is the only codec that gives no failure, also it's the 
> only one we have to scan over *twice (isUTF8() and then conversion ) so its 
> the most expensive. And on top of all this, hardly anyone uses it. So it's 
> most error prone, most expensive, and no one uses it. It *definitly* should 
> be the last check.

I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed so that
it is unlikely that a non-utf-8 string can recognized as utf-8. If the UTF-decoder cannot
fail, then what does it do when it encounters an illegal sequence?

On the other hand. How could an attempt to decode a string byes as IsoLatin1 fail? A
human user can say that something isn't latin1, but the computer cannot unless we
add a user specified blacklist, IMHO overkill.

> > Not really. Generally contact lists tend to consist of people from mostly
> > the same country. 
> 
> Eh huh? Not from my experience... I have people from here, from Europe, from 
> Asia. Anyways, contact lists don't really have much to do with it, especially 
> on IRC. Anyone could message you from anywhere out of the blue.

I suppose experience can vary here. To me it's either isolatin1 or ascii that comes
overr the wire.. With isolatin it's usually the same county or countries that use the
same character set. Nevertheless, the future will become more and more utf8:ized.

> My new proposed ordering in pseudo code:

sounds reasonable. 

Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The difference is that a few
characters that were "never" used were replaces by some that actually are used.

-- robin
_______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel