'[Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing errors,'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kopete-devel
Subject:    [Kopete-devel] [Bug 72917] UTF8 and other cause XML parsing errors,
From:       Martijn Klingens <klingens () kde ! org>
Date:       2004-01-26 19:56:19
Message-ID: 20040126195619.14562.qmail () ktown ! kde ! org
[Download RAW message or body]

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

http://bugs.kde.org/show_bug.cgi?id=72917      

------- Additional Comments From klingens@kde.org  2004-01-26 20:56 -------
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Monday 26 January 2004 15:08, Jason Keirstead wrote:
> > I want first and foremost to have accurate and autodetected conversion.
>
> This is impossible :P

True. But when using the right order (User Pref, Local Encoding, UTF-8, 
Latin1) at least you can make sure the chances of it failing are minimized.

> > The user's setting should be TRIED first, but not FORCED. If it is broken
> > utf8 we know it will break the parser, it makes no sense to obey the user
> > at all.
>
> So you mean, if the user chose UTF8 then we check isUTF8,  and if it is
> not, then replace with ? characters wherever needed?

Close. First, I would use QChar::replacement like Thiago mentioned instead of 
'?'.

Second, instead of doing the replacement if isUtf8 fails I would use Thiago's 
order, which would mean that after a Utf-8 failure latin1 is used.

Arguably we could better try Utf-8 BEFORE local encoding, because utf8 failure 
can be detected and local not in all cases (like when local is in Latin1).

> No. See, this is the problem. You are assuming that you should try UTF then
> if UTF fails then you'll be able to guess something.

Exactly.

> This is backwards. UTF is the only codec that gives no failure,

Yes. HOWEVER, Latin1 is even worse, because it CANNOT FAIL. Whatever you feed 
as Latin1, it is BY DEFINITION LEGAL. Thus, you can't do Utf-8 after Latin1, 
it _HAS_ to be done before Latin1.

> also it's the only one we have to scan over *twice (isUTF8() and then
> conversion ) so its the most expensive.

Like Thiago said, isUtf8() doesn't copy data and should be fairly inexpensive. 
Also, I would like to see figures of the additional load, I think it is in 
fact pretty much neglectable for most uses. After all QString is one of the 
most heavily optimized Qt classes. Do you have any KCacheGrind logs proving 
me wrong?

> And on top of all this, hardly anyone uses it.

More and more people start using it, especially with ICQ, which also needs 
this code. And, again, Utf-8 HAS to be checked before Latin1, because after 
trying Latin1 you cannot POSSIBLY get a failure.

So whether it "should" be the last check for performance reasons or not, it 
CANNOT be the last check, no matter how much you'd want it.

> There's no point trying local8bit, it's bound to fail.

This too is wrong for most non-western locales. In fact, with ICQ in Russia it 
would be VERY IMPORTANT to have.

> Eh huh? Not from my experience... I have people from here, from Europe,
> from Asia. Anyways, contact lists don't really have much to do with it,
> especially on IRC. Anyone could message you from anywhere out of the blue.

Try thinking outside the IRC box :) (With IRC I tend to agree with the people 
on channels being diverse, although many people I know are only on Dutch 
language IRC channels, and almost all people I know have exclusively Dutch 
people on their contact list. We open source people are quite a different 
breed from the average user base.)
_______________________________________________
Kopete-devel mailing list
Kopete-devel@kde.org
https://mail.kde.org/mailman/listinfo/kopete-devel
[prev in list] [next in list] [prev in thread] [next in thread]