[prev in list] [next in list] [prev in thread] [next in thread]
List: konversation-devel
Subject: Re: [Konversation-devel] Review Request 113823: Revamp decode pipeline to address several logic erro
From: "Bernd Buschinski" <b.buschinski () googlemail ! com>
Date: 2013-11-15 18:49:36
Message-ID: 20131115184936.6507.4497 () vidsolbach ! de
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113823/#review43747
-----------------------------------------------------------
Ship it!
Ship It!
- Bernd Buschinski
On Nov. 12, 2013, 7:02 p.m., Eike Hein wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/113823/
> -----------------------------------------------------------
>
> (Updated Nov. 12, 2013, 7:02 p.m.)
>
>
> Review request for Konversation and Eli MacKenzie.
>
>
> Repository: konversation
>
>
> Description
> -------
>
> This one's basically for Eli, and there's no way to get around reading the full \
> argument (which at one point does get written up nicely), so I'll just dump the \
> log:
> <eddyb> Sho_: hi, are you on Konversation right now?
> <eddyb> does this look like three weird characters (ignore the quotes)? "?" - but \
> then there's this: ? <Sho_> eddyb: no
> <PovAddict> both look the same to me, some kind of dot in a wider-than-normal \
> character <eddyb> ï ?
> <Sho_> eddyb: http://simplest-image-hosting.net/png-0-im1498
> <eddyb> although it's pointless if you're not on Konversation, that's where the \
> bug shows up <Sho_> The screenshot is from Konversation
> <nemo> Sho_: he's using a build from a few months back apparently
> <eddyb> uhm, no, I said something else
> <nemo> oh?
> <PovAddict> now I did see three different characters
> <eddyb> PovAddict: my second message?
> <PovAddict> http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
> Join dhenry_ (dhenry@nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
> Quit dhenry (dhenry@nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping \
> timeout: 272 seconds). <eddyb> 'ï ?'.split('').map(function(x){return \
> x.charCodeAt().toString(16)}) <eddyb> ["ef", "bc", "8e"]
> <eddyb> the codepoints are equal to the bytes in the utf8 representation of that \
> fancier character (U+ff0e) <nemo> fancier character? :)
> Quit zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
> terminated!). <eddyb> well, if it's visible at all
> Join Zanny (~zanny@64.9.41.220) has joined this channel.
> <eddyb> Sho_: I was in here back in January, but I'm not sure what happened with \
> this bug. I may be able to find the ranges (not many characters are affected, \
> IIRC), if you need them <Sho_> nemo: He's saying that for some reason U+ff0e \
> (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as three chars in \
> his Konvi Quit Zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join zanny (~zanny@64.9.41.220) has joined this channel.
> Part zanny (~zanny@64.9.41.220) has left this channel.
> <nemo> Sho_: right
> <nemo> Sho_: I was just amused by "fancier"
> <nemo> Sho_: he says it happens if
> <nemo> ï ?
> <nemo> is on its own on a single line, or something
> * nemo shrugs
> <eddyb> Sho_: you don't get those three characters for the lines with only \
> U+ff0e? <PovAddict> [14:55] >> :nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net \
> PRIVMSG #konversation :%EFC%8E%0A <PovAddict> argh
> <PovAddict> [14:55] >> :nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG \
> #konversation :%EFC%8E%0A <PovAddict> Sho_: what's the keystroke to avoid \
> formatting? ctrl-enter? >.> <Sho_> i think so
> <PovAddict> test
> <PovAddict> test
> <nemo> ew
> <Sho_> eddyb: yeah, i do ... mhh
> <eddyb> btw, my conclusions were that it happens for U+XY0f, where X != 0 and Y \
> has its top bit and at least another bit set Quit MadAGu \
> (~MadAGu@ppp005054075191.access.hol.gr) has left this server (Quit: Battle control \
> Terminated). Join zanny (~zanny@64.9.41.220) has joined this channel.
> <Sho_> I'm rebuilding with a modification, sec
>
> -cycle-
>
> Join You (~sho@kde/hein) have joined the channel #konversation.
> Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was \
> set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
> http://konversation.kde.org/ Join MadAGu \
> (~MadAGu@ppp005054075191.access.hol.gr) has joined this channel. <Sho_> hit me \
> with the testcase, please <nemo> ï ?
> <Sho_> ok, so it's not the unicode sterilizer at least (didn't think so)
> <nemo> commented it out? :)
> <Sho_> early return
> Mode Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> <Sho_> so it must be in the decode pipeline, then
> <eddyb> Sho_: adding another character after it makes it show up proper, that \
> might be important <eddyb> and I'm not sure it works after a longer message ï ?
> <Sho_> it might simply be a failure of the "is this unicode?" heuristic
> <Sho_> if we can't determine we're dealing with unicode, we fall back to latin1
> <Sho_> since we need to deal with mixed-encoding channels
> <eddyb> I think I have it forcing utf8. maybe those characters are blacklisted
> <Sho_> you can't force it unless you modified the source
> <Sho_> brb
>
> -cycle-
>
> Join You (~sho@kde/hein) have joined the channel #konversation.
> Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was \
> set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
> http://konversation.kde.org/ Notify zeigor is online (chat.freenode.org).
> Notify starbuck11 is online (chat.freenode.org).
> <Sho_> hit me
> <eddyb> ?
> <Sho_> there we go
> Mode Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> <Sho_> // if channel encoding is utf-8 and the string is definitely not \
> utf-8 <Sho_> // then try latin-1
> <Sho_> if (codec->mibEnum() == 106)
> <Sho_> codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
> <eddyb> ? ?????
> Quit zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
> terminated!). <eddyb> no, those look fine. hmmmm was this fixed?
> Join DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has \
> joined this channel. <eddyb> (playing with another bug. this time, it's something \
> I love) Join zanny (~zanny@64.9.41.220) has joined this channel.
> <Sho_> what was the bug?
> Quit zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join zanny (~zanny@64.9.41.220) has joined this channel.
> Quit zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join movciari (~movciari@ip-89-102-11-29.net.upcbroadband.cz) has joined this \
> channel. <Sho_> ok, so we have this body of code deciding that 0xEF 0xBC 0x8E \
> can't be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1 <eddyb> Sho_: \
> non-printable unicode ended up in large (like 5 lines tall) brackets and other \
> things <eddyb> Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
> Join atomik (~atomik@109.130.111.188) has joined this channel.
> Quit luiorpe1 (~luis@2001:720:101c:7003:222:43ff:fe44:10a0) has left this \
> server (Quit: Konversation terminated!). <eddyb> maybe it's off by one? I can't \
> tell <eddyb> Sho_: lol, I think all the trail bytes checks should use >= :P
> Quit deltra (~fedex@2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has left this \
> server (Quit: Konversation terminated!). <eddyb> that means there's two byte \
> encodings which can do this. hmmm <Sho_> eddyb: that code was written by \
> Netscape, not us, in our defense ;) <Sho_> brb
>
> -cycle-
>
> Join You (~sho@5.28.125.146) have joined the channel #konversation.
> Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was set \
> by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
> http://konversation.kde.org/ Mode Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> <Testkonv> hit me please
> <eddyb> ï ?
> <Testkonv> it's the japanese shit
> <Testkonv> figures
> <eddyb> huh?
> <Testkonv> eddyb: it trips line 17
> <eddyb> and here I was, thinking I found an off-by-one
> <Testkonv> i don't see why there would be an off by one
> <Testkonv> the %0A isn't in there if you thought that
> Quit e2718 (~hdesk@p54849222.dip0.t-ipconnect.de) has left this server (Quit: \
> Konversation terminated!). <eddyb> oh I'm an idiot, the condition is inverted from \
> what I thought <Testkonv> no, the problem is that cartman thought it would be \
> really smart to make a "is this utf8?" function also do "is this a japanese codec?" \
> <eddyb> so the case where i + clen == len still works. okay then, time to dive in \
> the japanese detection, I guess <Testkonv> eddyb: the problem is that the japanese \
> detection is probably right <Testkonv> that byte string might be perfectly valid \
> K_JIS or K_SJIS <eddyb> but I want Unicode - and by forcing I meant I switched from \
> Default (which I now realized would do nothing, because Default is just an alias) \
> Quit DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left \
> this server (Quit: Konversation terminated!). <Testkonv> yes
> Join DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
> this channel. <Testkonv> the problem is that our decode pipeline decision matrix is \
> the wrong way around <Testkonv> this is how it works right now:
> <Testkonv> 1. we determine if the byte array can be treated as utf-8, by way of a \
> function that checks if it could possibly *not* be utf-8 by way of sanity checking \
> and a japanese codec guesser <Testkonv> 2. if we think it can be treated as utf-8, \
> we treat it as utf-8 <Testkonv> 3. we determine it can't be treated as utf-8, we \
> look at the user setting <Testkonv> 4. if the user setting is utf-8, we fall back \
> to latin-1 since we already ruled out utf-8 <Testkonv> so basically, imho it should \
> do this instead: <Testkonv> 1. always use the user-setting, unless it's utf-8 and \
> our sanity-checker says it can't be utf-8 <Testkonv> 2. if our sanity-checker says \
> it can't be utf-8, try to guess a japanese codec. if we guess one, use that \
> <Testkonv> 3. otherwise fall back to latin 1 <Testkonv> that encompasses the same \
> feature set - handling of mixed-encoding channels and extra love for japanese \
> chatters - but respects user settings and avoids this clash between the \
> wrongly-used utf8 sanity checker and the japanese codec guesser <Testkonv> now, \
> it's pretty clear why the code wound up like it did <Testkonv> there was a desire \
> to optimize for performance <Testkonv> the idea was "if we can tell this is utf-8, \
> let's not re-encode to utf8" <Testkonv> but that was premature optimization and \
> wromgly implemented, because "is there any reason this can't be utf8?" isn't the \
> same as "this is utf8!" <Testkonv> did everyone fall from their chairs?
> Quit joaoc (~joaoc@131.227.207.40) has left this server (Quit: Konversation \
> terminated!). <Testkonv> eddyb: "Default" for a *channel* actually means "Use \
> what's set in the Identity" btw <eddyb> logic fallacies <3
> <Testkonv> morel ike "premature optimization <3"
> <eddyb> (I was thinking of the part where it's not the same as "this is utf8!")
> <Testkonv> whoever wrote the original code was trying to hard to be able to use \
> QString::fromUtf8() when he thought he could determine it's utf8 <Testkonv> but \
> that's dumb <Testkonv> because the sanity checker does much the same work as the \
> stream encoder anyway <Testkonv> (ok, a ltitle less, sure)
> <Testkonv> alrighty
> <Testkonv> what i'm going to now is put the above into code
> <eddyb> Testkonv: thanks a lot :D
> <Testkonv> then i'll dump that on reviewboard with a log of this chat, because \
> argnel needs to look at it and he's at work <PovAddict> and this is why \
> Konversation is a high-quality award-winning app <Testkonv> heh :)
> Join itaylor57 (~itaylor@pool-71-170-156-139.dllstx.fios.verizon.net) has joined \
> this channel. <psn> Testkonv: well actually the original author is pretty sure \
> the code has changed over the years ;) <psn> not that I can remember exactly what \
> the original code did... all I know is that there sure was no check for japanese \
> codecs :) <Testkonv> psn: nice to see you still lurk about ;)
> Part CounterPillow (~fratti@unaffiliated/counterpillow) has left this channel \
> ("Later folks!"). <psn> that's what I do best... lurking that is :)
> Join genstorm (~andreas@80-121-86-191.adsl.highway.telekom.at) has joined this \
> channel. <Testkonv> ok, running a build
> <Testkonv> eddyb: stand by for testing
> <Testkonv> oh ... wait, compile error
> Join luiorpe1 (~luiorpe1@84.126.68.164.dyn.user.ono.com) has joined this \
> channel. <Testkonv> there we go
>
> -cycle-
>
> Join You (~sho@5.28.125.146) have joined the channel #konversation.
> Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was set \
> by Sho_!~sho@kde/hein on 17.03.2013 03:18. Mode Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> URL Channel URL: http://konversation.kde.org/
> <Testkonv> hit me
> <Testkonv> PovAddict, you maybe?
> <PovAddict> I don't know what I'm supposed to send to trigger it
> <PovAddict> ?
> <PovAddict> that?
> <eddyb> ?
> <eddyb> (to be sure :P)
> <Testkonv> works fine
> Quit mischa (~mischa@84-112-236-94.dynamic.surfer.at) has left this server \
> (Quit: Konversation terminated!). <Testkonv> as an added bonus, we're no longer \
> instanciating and destroying the japanese codec guesser for EVERY LINE <nemo> :)
> <nemo> huh. why is that not a singleton?
> <nemo> seems like the kind of thing that would be a standalone function
> <nemo> wonder what the codec guesser instantiates
> <Testkonv> nemo: it's a static member now
>
>
> Diffs
> -----
>
> src/common.cpp bd84077
> src/irc/server.h be9d72d
> src/irc/server.cpp abb42c2
> src/unicode.cpp 6407b19
>
> Diff: http://git.reviewboard.kde.org/r/113823/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Eike Hein
>
>
[Attachment #5 (text/html)]
<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 \
solid;"> <tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="http://git.reviewboard.kde.org/r/113823/">http://git.reviewboard.kde.org/r/113823/</a>
</td>
</tr>
</table>
<br />
<p>Ship it!</p>
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Ship It!</pre> <br />
<p>- Bernd Buschinski</p>
<br />
<p>On November 12th, 2013, 7:02 p.m. UTC, Eike Hein wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('http://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black \
solid;"> <tr>
<td>
<div>Review request for Konversation and Eli MacKenzie.</div>
<div>By Eike Hein.</div>
<p style="color: grey;"><i>Updated Nov. 12, 2013, 7:02 p.m.</i></p>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt;">Repository: </b>
konversation
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">This one's basically for Eli, and there's no way to get around \
reading the full argument (which at one point does get written up nicely), so \
I'll just dump the log:
<eddyb> Sho_: hi, are you on Konversation right now?
<eddyb> does this look like three weird characters (ignore the quotes)? \
"?" - but then there's this: ? <Sho_> eddyb: no
<PovAddict> both look the same to me, some kind of dot in a wider-than-normal \
character <eddyb> ï ?
<Sho_> eddyb: http://simplest-image-hosting.net/png-0-im1498
<eddyb> although it's pointless if you're not on Konversation, \
that's where the bug shows up <Sho_> The screenshot is from Konversation
<nemo> Sho_: he's using a build from a few months back apparently
<eddyb> uhm, no, I said something else
<nemo> oh?
<PovAddict> now I did see three different characters
<eddyb> PovAddict: my second message?
<PovAddict> http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
Join dhenry_ (dhenry@nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
Quit dhenry (dhenry@nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping \
timeout: 272 seconds). <eddyb> 'ï \
?'.split('').map(function(x){return x.charCodeAt().toString(16)}) \
<eddyb> ["ef", "bc", "8e"] <eddyb> the \
codepoints are equal to the bytes in the utf8 representation of that fancier \
character (U+ff0e) <nemo> fancier character? :)
Quit zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
terminated!). <eddyb> well, if it's visible at all
Join Zanny (~zanny@64.9.41.220) has joined this channel.
<eddyb> Sho_: I was in here back in January, but I'm not sure what \
happened with this bug. I may be able to find the ranges (not many characters are \
affected, IIRC), if you need them <Sho_> nemo: He's saying that for some \
reason U+ff0e (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as \
three chars in his Konvi Quit Zanny (~zanny@64.9.41.220) has left this server \
(Client Quit). Join zanny (~zanny@64.9.41.220) has joined this channel.
Part zanny (~zanny@64.9.41.220) has left this channel.
<nemo> Sho_: right
<nemo> Sho_: I was just amused by "fancier"
<nemo> Sho_: he says it happens if
<nemo> ï ?
<nemo> is on its own on a single line, or something
* nemo shrugs
<eddyb> Sho_: you don't get those three characters for the lines with \
only U+ff0e? <PovAddict> [14:55] >> \
:nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A \
<PovAddict> argh <PovAddict> [14:55] >> \
:nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A \
<PovAddict> Sho_: what's the keystroke to avoid formatting? ctrl-enter? \
>.> <Sho_> i think so
<PovAddict> test
<PovAddict> test
<nemo> ew
<Sho_> eddyb: yeah, i do ... mhh
<eddyb> btw, my conclusions were that it happens for U+XY0f, where X != 0 and \
Y has its top bit and at least another bit set Quit MadAGu \
(~MadAGu@ppp005054075191.access.hol.gr) has left this server (Quit: Battle control \
Terminated). Join zanny (~zanny@64.9.41.220) has joined this channel.
<Sho_> I'm rebuilding with a modification, sec
-cycle-
Join You (~sho@kde/hein) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it \
at http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic \
was set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
http://konversation.kde.org/ Join MadAGu (~MadAGu@ppp005054075191.access.hol.gr) \
has joined this channel. <Sho_> hit me with the testcase, please
<nemo> ï ?
<Sho_> ok, so it's not the unicode sterilizer at least (didn't think \
so) <nemo> commented it out? :)
<Sho_> early return
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Sho_> so it must be in the decode pipeline, then
<eddyb> Sho_: adding another character after it makes it show up proper, that \
might be important <eddyb> and I'm not sure it works after a longer \
message ï ? <Sho_> it might simply be a failure of the "is this \
unicode?" heuristic <Sho_> if we can't determine we're dealing \
with unicode, we fall back to latin1 <Sho_> since we need to deal with \
mixed-encoding channels <eddyb> I think I have it forcing utf8. maybe those \
characters are blacklisted <Sho_> you can't force it unless you modified \
the source <Sho_> brb
-cycle-
Join You (~sho@kde/hein) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it \
at http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic \
was set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
http://konversation.kde.org/ Notify zeigor is online (chat.freenode.org).
Notify starbuck11 is online (chat.freenode.org).
<Sho_> hit me
<eddyb> ?
<Sho_> there we go
Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Sho_> // if channel encoding is utf-8 and the string is definitely \
not utf-8 <Sho_> // then try latin-1
<Sho_> if (codec->mibEnum() == 106)
<Sho_> codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
<eddyb> ? ?????
Quit zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
terminated!). <eddyb> no, those look fine. hmmmm was this fixed?
Join DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
this channel. <eddyb> (playing with another bug. this time, it's \
something I love) Join zanny (~zanny@64.9.41.220) has joined this channel.
<Sho_> what was the bug?
Quit zanny (~zanny@64.9.41.220) has left this server (Client Quit).
Join zanny (~zanny@64.9.41.220) has joined this channel.
Quit zanny (~zanny@64.9.41.220) has left this server (Client Quit).
Join movciari (~movciari@ip-89-102-11-29.net.upcbroadband.cz) has joined this \
channel. <Sho_> ok, so we have this body of code deciding that 0xEF 0xBC \
0x8E can't be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1 <eddyb> \
Sho_: non-printable unicode ended up in large (like 5 lines tall) brackets and other \
things <eddyb> Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
Join atomik (~atomik@109.130.111.188) has joined this channel.
Quit luiorpe1 (~luis@2001:720:101c:7003:222:43ff:fe44:10a0) has left this server \
(Quit: Konversation terminated!). <eddyb> maybe it's off by one? I \
can't tell <eddyb> Sho_: lol, I think all the trail bytes checks should \
use >= :P Quit deltra (~fedex@2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has \
left this server (Quit: Konversation terminated!). <eddyb> that means \
there's two byte encodings which can do this. hmmm <Sho_> eddyb: that \
code was written by Netscape, not us, in our defense ;) <Sho_> brb
-cycle-
Join You (~sho@5.28.125.146) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was \
set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL Channel URL: \
http://konversation.kde.org/ Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
<Testkonv> hit me please
<eddyb> ï ?
<Testkonv> it's the japanese shit
<Testkonv> figures
<eddyb> huh?
<Testkonv> eddyb: it trips line 17
<eddyb> and here I was, thinking I found an off-by-one
<Testkonv> i don't see why there would be an off by one
<Testkonv> the %0A isn't in there if you thought that
Quit e2718 (~hdesk@p54849222.dip0.t-ipconnect.de) has left this server (Quit: \
Konversation terminated!). <eddyb> oh I'm an idiot, the condition is \
inverted from what I thought <Testkonv> no, the problem is that cartman thought \
it would be really smart to make a "is this utf8?" function also do \
"is this a japanese codec?" <eddyb> so the case where i + clen == len \
still works. okay then, time to dive in the japanese detection, I guess \
<Testkonv> eddyb: the problem is that the japanese detection is probably right \
<Testkonv> that byte string might be perfectly valid K_JIS or K_SJIS \
<eddyb> but I want Unicode - and by forcing I meant I switched from Default \
(which I now realized would do nothing, because Default is just an alias) Quit \
DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left this server \
(Quit: Konversation terminated!). <Testkonv> yes
Join DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
this channel. <Testkonv> the problem is that our decode pipeline decision \
matrix is the wrong way around <Testkonv> this is how it works right now:
<Testkonv> 1. we determine if the byte array can be treated as utf-8, by way of \
a function that checks if it could possibly *not* be utf-8 by way of sanity checking \
and a japanese codec guesser <Testkonv> 2. if we think it can be treated as \
utf-8, we treat it as utf-8 <Testkonv> 3. we determine it can't be treated \
as utf-8, we look at the user setting <Testkonv> 4. if the user setting is \
utf-8, we fall back to latin-1 since we already ruled out utf-8 <Testkonv> so \
basically, imho it should do this instead: <Testkonv> 1. always use the \
user-setting, unless it's utf-8 and our sanity-checker says it can't be utf-8 \
<Testkonv> 2. if our sanity-checker says it can't be utf-8, try to guess a \
japanese codec. if we guess one, use that <Testkonv> 3. otherwise fall back to \
latin 1 <Testkonv> that encompasses the same feature set - handling of \
mixed-encoding channels and extra love for japanese chatters - but respects user \
settings and avoids this clash between the wrongly-used utf8 sanity checker and the \
japanese codec guesser <Testkonv> now, it's pretty clear why the code wound \
up like it did <Testkonv> there was a desire to optimize for performance
<Testkonv> the idea was "if we can tell this is utf-8, let's not \
re-encode to utf8" <Testkonv> but that was premature optimization and \
wromgly implemented, because "is there any reason this can't be utf8?" \
isn't the same as "this is utf8!" <Testkonv> did everyone fall \
from their chairs? Quit joaoc (~joaoc@131.227.207.40) has left this server (Quit: \
Konversation terminated!). <Testkonv> eddyb: "Default" for a \
*channel* actually means "Use what's set in the Identity" btw \
<eddyb> logic fallacies <3 <Testkonv> morel ike "premature \
optimization <3" <eddyb> (I was thinking of the part where it's not \
the same as "this is utf8!") <Testkonv> whoever wrote the original \
code was trying to hard to be able to use QString::fromUtf8() when he thought he \
could determine it's utf8 <Testkonv> but that's dumb
<Testkonv> because the sanity checker does much the same work as the stream \
encoder anyway <Testkonv> (ok, a ltitle less, sure)
<Testkonv> alrighty
<Testkonv> what i'm going to now is put the above into code
<eddyb> Testkonv: thanks a lot :D
<Testkonv> then i'll dump that on reviewboard with a log of this chat, \
because argnel needs to look at it and he's at work <PovAddict> and \
this is why Konversation is a high-quality award-winning app <Testkonv> heh :)
Join itaylor57 (~itaylor@pool-71-170-156-139.dllstx.fios.verizon.net) has joined \
this channel. <psn> Testkonv: well actually the original author is pretty \
sure the code has changed over the years ;) <psn> not that I can remember \
exactly what the original code did... all I know is that there sure was no check for \
japanese codecs :) <Testkonv> psn: nice to see you still lurk about ;)
Part CounterPillow (~fratti@unaffiliated/counterpillow) has left this channel \
("Later folks!"). <psn> that's what I do best... lurking that \
is :) Join genstorm (~andreas@80-121-86-191.adsl.highway.telekom.at) has joined \
this channel. <Testkonv> ok, running a build
<Testkonv> eddyb: stand by for testing
<Testkonv> oh ... wait, compile error
Join luiorpe1 (~luiorpe1@84.126.68.164.dyn.user.ono.com) has joined this channel.
<Testkonv> there we go
-cycle-
Join You (~sho@5.28.125.146) have joined the channel #konversation.
Topic The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic The topic was \
set by Sho_!~sho@kde/hein on 17.03.2013 03:18. Mode Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
URL Channel URL: http://konversation.kde.org/
<Testkonv> hit me
<Testkonv> PovAddict, you maybe?
<PovAddict> I don't know what I'm supposed to send to trigger it
<PovAddict> ?
<PovAddict> that?
<eddyb> ?
<eddyb> (to be sure :P)
<Testkonv> works fine
Quit mischa (~mischa@84-112-236-94.dynamic.surfer.at) has left this server (Quit: \
Konversation terminated!). <Testkonv> as an added bonus, we're no longer \
instanciating and destroying the japanese codec guesser for EVERY LINE <nemo> \
:) <nemo> huh. why is that not a singleton?
<nemo> seems like the kind of thing that would be a standalone function
<nemo> wonder what the codec guesser instantiates
<Testkonv> nemo: it's a static member now
</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>src/common.cpp <span style="color: grey">(bd84077)</span></li>
<li>src/irc/server.h <span style="color: grey">(be9d72d)</span></li>
<li>src/irc/server.cpp <span style="color: grey">(abb42c2)</span></li>
<li>src/unicode.cpp <span style="color: grey">(6407b19)</span></li>
</ul>
<p><a href="http://git.reviewboard.kde.org/r/113823/diff/" style="margin-left: \
3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>
_______________________________________________
Konversation-devel mailing list
Konversation-devel@kde.org
https://mail.kde.org/mailman/listinfo/konversation-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic