[prev in list] [next in list] [prev in thread] [next in thread] 

List:       konversation-devel
Subject:    Re: [Konversation-devel] Review Request 113823: Revamp decode pipeline to address several logic erro
From:       "Bernd Buschinski" <b.buschinski () googlemail ! com>
Date:       2013-11-15 18:49:36
Message-ID: 20131115184936.6507.4497 () vidsolbach ! de
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/113823/#review43747
-----------------------------------------------------------

Ship it!


Ship It!

- Bernd Buschinski


On Nov. 12, 2013, 7:02 p.m., Eike Hein wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/113823/
> -----------------------------------------------------------
> 
> (Updated Nov. 12, 2013, 7:02 p.m.)
> 
> 
> Review request for Konversation and Eli MacKenzie.
> 
> 
> Repository: konversation
> 
> 
> Description
> -------
> 
> This one's basically for Eli, and there's no way to get around reading the full \
> argument (which at one point does get written up nicely), so I'll just dump the \
> log: 
> <eddyb>   Sho_: hi, are you on Konversation right now?
> <eddyb>   does this look like three weird characters (ignore the quotes)? "?" - but \
> then there's this: ? <Sho_>    eddyb: no
> <PovAddict>  both look the same to me, some kind of dot in a wider-than-normal \
> character <eddyb>   ï ?
> <Sho_>    eddyb: http://simplest-image-hosting.net/png-0-im1498
> <eddyb>   although it's pointless if you're not on Konversation, that's where the \
> bug shows up <Sho_>    The screenshot is from Konversation
> <nemo>    Sho_: he's using a build from a few months back apparently
> <eddyb>   uhm, no, I said something else
> <nemo>    oh?
> <PovAddict>  now I did see three different characters
> <eddyb>   PovAddict: my second message?
> <PovAddict>  http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
> Join      dhenry_ (dhenry@nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
> Quit      dhenry (dhenry@nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping \
> timeout: 272 seconds). <eddyb>   'ï ?'.split('').map(function(x){return \
> x.charCodeAt().toString(16)}) <eddyb>   ["ef", "bc", "8e"]
> <eddyb>   the codepoints are equal to the bytes in the utf8 representation of that \
> fancier character (U+ff0e) <nemo>    fancier character? :)
> Quit      zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
> terminated!). <eddyb>   well, if it's visible at all
> Join      Zanny (~zanny@64.9.41.220) has joined this channel.
> <eddyb>   Sho_: I was in here back in January, but I'm not sure what happened with \
> this bug. I may be able to find the ranges (not many characters are affected, \
> IIRC), if you need them <Sho_>    nemo: He's saying that for some reason U+ff0e \
> (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as three chars in \
> his Konvi Quit      Zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join      zanny (~zanny@64.9.41.220) has joined this channel.
> Part      zanny (~zanny@64.9.41.220) has left this channel.
> <nemo>    Sho_: right
> <nemo>    Sho_: I was just amused by "fancier"
> <nemo>    Sho_: he says it happens if 
> <nemo>    ï ?
> <nemo>    is on its own on a single line, or something
> * nemo shrugs
> <eddyb>   Sho_: you don't get those three characters for the lines with only \
> U+ff0e? <PovAddict>  [14:55] >> :nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net \
> PRIVMSG #konversation :%EFC%8E%0A <PovAddict>  argh
> <PovAddict>  [14:55] >> :nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG \
> #konversation :%EFC%8E%0A <PovAddict>  Sho_: what's the keystroke to avoid \
> formatting? ctrl-enter? >.> <Sho_>    i think so
> <PovAddict>  test
> <PovAddict>  test
> <nemo>    ew
> <Sho_>    eddyb: yeah, i do ... mhh
> <eddyb>   btw, my conclusions were that it happens for U+XY0f, where X != 0 and Y \
> has its top bit and at least another bit set Quit      MadAGu \
> (~MadAGu@ppp005054075191.access.hol.gr) has left this server (Quit: Battle control \
> Terminated). Join      zanny (~zanny@64.9.41.220) has joined this channel.
> <Sho_>    I'm rebuilding with a modification, sec
> 
> -cycle-
> 
> Join      You (~sho@kde/hein) have joined the channel #konversation.
> Topic     The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic     The topic was \
> set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL       Channel URL: \
> http://konversation.kde.org/ Join      MadAGu \
> (~MadAGu@ppp005054075191.access.hol.gr) has joined this channel. <Sho_>    hit me \
> with the testcase, please <nemo>    ï ?
> <Sho_>    ok, so it's not the unicode sterilizer at least (didn't think so)
> <nemo>    commented it out? :)
> <Sho_>    early return
> Mode      Channel modes: +nt
> Created   This channel was created on 26.11.2006 07:42.
> <Sho_>    so it must be in the decode pipeline, then
> <eddyb>   Sho_: adding another character after it makes it show up proper, that \
> might be important <eddyb>   and I'm not sure it works after a longer message ï ?
> <Sho_>    it might simply be a failure of the "is this unicode?" heuristic
> <Sho_>    if we can't determine we're dealing with unicode, we fall back to latin1
> <Sho_>    since we need to deal with mixed-encoding channels
> <eddyb>   I think I have it forcing utf8. maybe those characters are blacklisted
> <Sho_>    you can't force it unless you modified the source
> <Sho_>    brb
> 
> -cycle-
> 
> Join      You (~sho@kde/hein) have joined the channel #konversation.
> Topic     The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic     The topic was \
> set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL       Channel URL: \
> http://konversation.kde.org/ Notify    zeigor is online (chat.freenode.org).
> Notify    starbuck11 is online (chat.freenode.org).
> <Sho_>    hit me
> <eddyb>   ?
> <Sho_>    there we go
> Mode      Channel modes: +nt
> Created   This channel was created on 26.11.2006 07:42.
> <Sho_>           // if channel encoding is utf-8 and the string is definitely not \
> utf-8 <Sho_>           // then try latin-1
> <Sho_>           if (codec->mibEnum() == 106)
> <Sho_>               codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
> <eddyb>   ? ?????
> Quit      zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
> terminated!). <eddyb>   no, those look fine. hmmmm was this fixed?
> Join      DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has \
> joined this channel. <eddyb>   (playing with another bug. this time, it's something \
> I love) Join      zanny (~zanny@64.9.41.220) has joined this channel.
> <Sho_>    what was the bug?
> Quit      zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join      zanny (~zanny@64.9.41.220) has joined this channel.
> Quit      zanny (~zanny@64.9.41.220) has left this server (Client Quit).
> Join      movciari (~movciari@ip-89-102-11-29.net.upcbroadband.cz) has joined this \
> channel. <Sho_>    ok, so we have this body of code deciding that 0xEF 0xBC 0x8E \
> can't be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1 <eddyb>   Sho_: \
> non-printable unicode ended up in large (like 5 lines tall) brackets and other \
> things <eddyb>   Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
> Join      atomik (~atomik@109.130.111.188) has joined this channel.
> Quit      luiorpe1 (~luis@2001:720:101c:7003:222:43ff:fe44:10a0) has left this \
> server (Quit: Konversation terminated!). <eddyb>   maybe it's off by one? I can't \
> tell <eddyb>   Sho_: lol, I think all the trail bytes checks should use >= :P
> Quit      deltra (~fedex@2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has left this \
> server (Quit: Konversation terminated!). <eddyb>   that means there's two byte \
> encodings which can do this. hmmm <Sho_>    eddyb: that code was written by \
> Netscape, not us, in our defense ;) <Sho_>    brb
> 
> -cycle-
> 
> Join    You (~sho@5.28.125.146) have joined the channel #konversation.
> Topic   The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic   The topic was set \
> by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL     Channel URL: \
> http://konversation.kde.org/ Mode    Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> <Testkonv> hit me please
> <eddyb> ï ?
> <Testkonv> it's the japanese shit
> <Testkonv> figures
> <eddyb> huh?
> <Testkonv> eddyb: it trips line 17
> <eddyb> and here I was, thinking I found an off-by-one
> <Testkonv> i don't see why there would be an off by one
> <Testkonv> the %0A isn't in there if you thought that
> Quit    e2718 (~hdesk@p54849222.dip0.t-ipconnect.de) has left this server (Quit: \
> Konversation terminated!). <eddyb> oh I'm an idiot, the condition is inverted from \
> what I thought <Testkonv> no, the problem is that cartman thought it would be \
> really smart to make a "is this utf8?" function also do "is this a japanese codec?" \
> <eddyb> so the case where i + clen == len still works. okay then, time to dive in \
> the japanese detection, I guess <Testkonv> eddyb: the problem is that the japanese \
> detection is probably right <Testkonv> that byte string might be perfectly valid \
> K_JIS or K_SJIS <eddyb> but I want Unicode - and by forcing I meant I switched from \
> Default (which I now realized would do nothing, because Default is just an alias) \
> Quit    DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left \
> this server (Quit: Konversation terminated!). <Testkonv> yes
> Join    DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
> this channel. <Testkonv> the problem is that our decode pipeline decision matrix is \
> the wrong way around <Testkonv> this is how it works right now:
> <Testkonv> 1. we determine if the byte array can be treated as utf-8, by way of a \
> function that checks if it could possibly *not* be utf-8 by way of sanity checking \
> and a japanese codec guesser <Testkonv> 2. if we think it can be treated as utf-8, \
> we treat it as utf-8 <Testkonv> 3. we determine it can't be treated as utf-8, we \
> look at the user setting <Testkonv> 4. if the user setting is utf-8, we fall back \
> to latin-1 since we already ruled out utf-8 <Testkonv> so basically, imho it should \
> do this instead: <Testkonv> 1. always use the user-setting, unless it's utf-8 and \
> our sanity-checker says it can't be utf-8 <Testkonv> 2. if our sanity-checker says \
> it can't be utf-8, try to guess a japanese codec. if we guess one, use that \
> <Testkonv> 3. otherwise  fall back to latin 1 <Testkonv> that encompasses the same \
> feature set - handling of mixed-encoding channels and extra love for japanese \
> chatters - but respects user settings and avoids this clash between the \
> wrongly-used utf8 sanity checker and the japanese codec guesser <Testkonv> now, \
> it's pretty clear why the code wound up like it did <Testkonv> there was a desire \
> to optimize for performance <Testkonv> the idea was "if we can tell this is utf-8, \
> let's not re-encode to utf8" <Testkonv> but that was premature optimization and \
> wromgly implemented, because "is there any reason this can't be utf8?" isn't the \
> same as "this is utf8!" <Testkonv> did everyone fall from their chairs?
> Quit    joaoc (~joaoc@131.227.207.40) has left this server (Quit: Konversation \
> terminated!). <Testkonv> eddyb: "Default" for a *channel* actually means "Use \
> what's set in the Identity" btw <eddyb> logic fallacies <3
> <Testkonv> morel ike "premature optimization <3"
> <eddyb> (I was thinking of the part where it's not the same as "this is utf8!")
> <Testkonv> whoever wrote the original code was trying to hard to be able to use \
> QString::fromUtf8() when he thought he could determine it's utf8 <Testkonv> but \
> that's dumb <Testkonv> because the sanity checker does much the same work as the \
> stream encoder anyway <Testkonv> (ok, a ltitle less, sure)
> <Testkonv> alrighty
> <Testkonv> what i'm going to now is put the above into code
> <eddyb> Testkonv: thanks a lot :D
> <Testkonv> then i'll dump that on reviewboard with a log of this chat, because \
> argnel needs to look at it and he's at work <PovAddict>     and this is why \
> Konversation is a high-quality award-winning app <Testkonv> heh :)
> Join    itaylor57 (~itaylor@pool-71-170-156-139.dllstx.fios.verizon.net) has joined \
> this channel. <psn>   Testkonv: well actually the original author is pretty sure \
> the code has changed over the years ;) <psn>   not that I can remember exactly what \
> the original code did... all I know is that there sure was no check for japanese \
> codecs :) <Testkonv> psn: nice to see you still lurk about ;)
> Part    CounterPillow (~fratti@unaffiliated/counterpillow) has left this channel \
> ("Later folks!"). <psn>   that's what I do best... lurking that is :)
> Join    genstorm (~andreas@80-121-86-191.adsl.highway.telekom.at) has joined this \
> channel. <Testkonv> ok, running a build
> <Testkonv> eddyb: stand by for testing
> <Testkonv> oh ... wait, compile error
> Join    luiorpe1 (~luiorpe1@84.126.68.164.dyn.user.ono.com) has joined this \
> channel. <Testkonv> there we go
> 
> -cycle-
> 
> Join    You (~sho@5.28.125.146) have joined the channel #konversation.
> Topic   The channel topic is "Konversation 1.5-rc1 has been released! Get it at \
> http://konversation.kde.org || Get the latest sources: \
> http://userbase.kde.org/Konversation/Sources || Wiki: \
> http://userbase.kde.org/Konversation || Report bugs and wishes: \
> https://bugs.kde.org/enter_bug.cgi?product=konversation". Topic   The topic was set \
> by Sho_!~sho@kde/hein on 17.03.2013 03:18. Mode    Channel modes: +nt
> Created This channel was created on 26.11.2006 07:42.
> URL     Channel URL: http://konversation.kde.org/
> <Testkonv> hit me
> <Testkonv> PovAddict, you maybe?
> <PovAddict>     I don't know what I'm supposed to send to trigger it
> <PovAddict>     ?
> <PovAddict>     that?
> <eddyb> ?
> <eddyb> (to be sure :P)
> <Testkonv> works fine
> Quit    mischa (~mischa@84-112-236-94.dynamic.surfer.at) has left this server \
> (Quit: Konversation terminated!). <Testkonv> as an added bonus, we're no longer \
> instanciating and destroying the japanese codec guesser for EVERY LINE <nemo>  :)
> <nemo>  huh. why is that not a singleton?
> <nemo>  seems like the kind of thing that would be a standalone function
> <nemo>  wonder what the codec guesser instantiates 
> <Testkonv> nemo: it's a static member now
> 
> 
> Diffs
> -----
> 
> src/common.cpp bd84077 
> src/irc/server.h be9d72d 
> src/irc/server.cpp abb42c2 
> src/unicode.cpp 6407b19 
> 
> Diff: http://git.reviewboard.kde.org/r/113823/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Eike Hein
> 
> 


[Attachment #5 (text/html)]

<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 \
solid;">  <tr>
     <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="http://git.reviewboard.kde.org/r/113823/">http://git.reviewboard.kde.org/r/113823/</a>
  </td>
    </tr>
   </table>
   <br />



 <p>Ship it!</p>



 <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Ship It!</pre>  <br />









<p>- Bernd Buschinski</p>


<br />
<p>On November 12th, 2013, 7:02 p.m. UTC, Eike Hein wrote:</p>








<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('http://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black \
solid;">  <tr>
  <td>

<div>Review request for Konversation and Eli MacKenzie.</div>
<div>By Eike Hein.</div>


<p style="color: grey;"><i>Updated Nov. 12, 2013, 7:02 p.m.</i></p>









<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
konversation
</div>


<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">This one&#39;s basically for Eli, and there&#39;s no way to get around \
reading the full argument (which at one point does get written up nicely), so \
I&#39;ll just dump the log:

&lt;eddyb&gt;   Sho_: hi, are you on Konversation right now?
&lt;eddyb&gt;   does this look like three weird characters (ignore the quotes)? \
&quot;?&quot; - but then there&#39;s this: ? &lt;Sho_&gt;    eddyb: no
&lt;PovAddict&gt;  both look the same to me, some kind of dot in a wider-than-normal \
character &lt;eddyb&gt;   ï ?
&lt;Sho_&gt;    eddyb: http://simplest-image-hosting.net/png-0-im1498
&lt;eddyb&gt;   although it&#39;s pointless if you&#39;re not on Konversation, \
that&#39;s where the bug shows up &lt;Sho_&gt;    The screenshot is from Konversation
&lt;nemo&gt;    Sho_: he&#39;s using a build from a few months back apparently
&lt;eddyb&gt;   uhm, no, I said something else
&lt;nemo&gt;    oh?
&lt;PovAddict&gt;  now I did see three different characters
&lt;eddyb&gt;   PovAddict: my second message?
&lt;PovAddict&gt;  http://wstaw.org/m/2013/11/12/plasma-desktopVY4678.png
Join      dhenry_ (dhenry@nat/redhat/x-xuezzbbmjbcdcqsv) has joined this channel.
Quit      dhenry (dhenry@nat/redhat/x-oxfhrtejykuqifva) has left this server (Ping \
timeout: 272 seconds). &lt;eddyb&gt;   &#39;ï \
?&#39;.split(&#39;&#39;).map(function(x){return x.charCodeAt().toString(16)}) \
&lt;eddyb&gt;   [&quot;ef&quot;, &quot;bc&quot;, &quot;8e&quot;] &lt;eddyb&gt;   the \
codepoints are equal to the bytes in the utf8 representation of that fancier \
character (U+ff0e) &lt;nemo&gt;    fancier character? :)
Quit      zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
terminated!). &lt;eddyb&gt;   well, if it&#39;s visible at all
Join      Zanny (~zanny@64.9.41.220) has joined this channel.
&lt;eddyb&gt;   Sho_: I was in here back in January, but I&#39;m not sure what \
happened with this bug. I may be able to find the ranges (not many characters are \
affected, IIRC), if you need them &lt;Sho_&gt;    nemo: He&#39;s saying that for some \
reason U+ff0e (which encoded to Utf-8 is 0xEF 0xBC 0x8E in hex notation) shows as \
three chars in his Konvi Quit      Zanny (~zanny@64.9.41.220) has left this server \
(Client Quit). Join      zanny (~zanny@64.9.41.220) has joined this channel.
Part      zanny (~zanny@64.9.41.220) has left this channel.
&lt;nemo&gt;    Sho_: right
&lt;nemo&gt;    Sho_: I was just amused by &quot;fancier&quot;
&lt;nemo&gt;    Sho_: he says it happens if 
&lt;nemo&gt;    ï ?
&lt;nemo&gt;    is on its own on a single line, or something
   * nemo shrugs
&lt;eddyb&gt;   Sho_: you don&#39;t get those three characters for the lines with \
only U+ff0e? &lt;PovAddict&gt;  [14:55] &gt;&gt; \
:nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A \
&lt;PovAddict&gt;  argh &lt;PovAddict&gt;  [14:55] &gt;&gt; \
:nemo!nemo@c-68-50-78-21.hsd1.md.comcast.net PRIVMSG #konversation :%EFC%8E%0A \
&lt;PovAddict&gt;  Sho_: what&#39;s the keystroke to avoid formatting? ctrl-enter? \
&gt;.&gt; &lt;Sho_&gt;    i think so
&lt;PovAddict&gt;  test
&lt;PovAddict&gt;  test
&lt;nemo&gt;    ew
&lt;Sho_&gt;    eddyb: yeah, i do ... mhh
&lt;eddyb&gt;   btw, my conclusions were that it happens for U+XY0f, where X != 0 and \
Y has its top bit and at least another bit set Quit      MadAGu \
(~MadAGu@ppp005054075191.access.hol.gr) has left this server (Quit: Battle control \
Terminated). Join      zanny (~zanny@64.9.41.220) has joined this channel.
&lt;Sho_&gt;    I&#39;m rebuilding with a modification, sec

-cycle-

Join      You (~sho@kde/hein) have joined the channel #konversation.
Topic     The channel topic is &quot;Konversation 1.5-rc1 has been released! Get it \
at http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation&quot;. Topic     The topic \
was set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL       Channel URL: \
http://konversation.kde.org/ Join      MadAGu (~MadAGu@ppp005054075191.access.hol.gr) \
has joined this channel. &lt;Sho_&gt;    hit me with the testcase, please
&lt;nemo&gt;    ï ?
&lt;Sho_&gt;    ok, so it&#39;s not the unicode sterilizer at least (didn&#39;t think \
so) &lt;nemo&gt;    commented it out? :)
&lt;Sho_&gt;    early return
Mode      Channel modes: +nt
Created   This channel was created on 26.11.2006 07:42.
&lt;Sho_&gt;    so it must be in the decode pipeline, then
&lt;eddyb&gt;   Sho_: adding another character after it makes it show up proper, that \
might be important &lt;eddyb&gt;   and I&#39;m not sure it works after a longer \
message ï ? &lt;Sho_&gt;    it might simply be a failure of the &quot;is this \
unicode?&quot; heuristic &lt;Sho_&gt;    if we can&#39;t determine we&#39;re dealing \
with unicode, we fall back to latin1 &lt;Sho_&gt;    since we need to deal with \
mixed-encoding channels &lt;eddyb&gt;   I think I have it forcing utf8. maybe those \
characters are blacklisted &lt;Sho_&gt;    you can&#39;t force it unless you modified \
the source &lt;Sho_&gt;    brb

-cycle-

Join      You (~sho@kde/hein) have joined the channel #konversation.
Topic     The channel topic is &quot;Konversation 1.5-rc1 has been released! Get it \
at http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation&quot;. Topic     The topic \
was set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL       Channel URL: \
http://konversation.kde.org/ Notify    zeigor is online (chat.freenode.org).
Notify    starbuck11 is online (chat.freenode.org).
&lt;Sho_&gt;    hit me
&lt;eddyb&gt;   ?
&lt;Sho_&gt;    there we go
Mode      Channel modes: +nt
Created   This channel was created on 26.11.2006 07:42.
&lt;Sho_&gt;           // if channel encoding is utf-8 and the string is definitely \
not utf-8 &lt;Sho_&gt;           // then try latin-1
&lt;Sho_&gt;           if (codec-&gt;mibEnum() == 106)
&lt;Sho_&gt;               codec = QTextCodec::codecForMib( 4 /* iso-8859-1 */ );
&lt;eddyb&gt;   ? ?????
Quit      zanny (~zanny@64.9.41.220) has left this server (Quit: Konversation \
terminated!). &lt;eddyb&gt;   no, those look fine. hmmmm was this fixed?
Join      DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
this channel. &lt;eddyb&gt;   (playing with another bug. this time, it&#39;s \
something I love) Join      zanny (~zanny@64.9.41.220) has joined this channel.
&lt;Sho_&gt;    what was the bug?
Quit      zanny (~zanny@64.9.41.220) has left this server (Client Quit).
Join      zanny (~zanny@64.9.41.220) has joined this channel.
Quit      zanny (~zanny@64.9.41.220) has left this server (Client Quit).
Join      movciari (~movciari@ip-89-102-11-29.net.upcbroadband.cz) has joined this \
channel. &lt;Sho_&gt;    ok, so we have this body of code deciding that 0xEF 0xBC \
0x8E can&#39;t be utf8: http://pastebin.kde.org/pcqljwaod/npqvb1 &lt;eddyb&gt;   \
Sho_: non-printable unicode ended up in large (like 5 lines tall) brackets and other \
things &lt;eddyb&gt;   Sho_: so http://pastebin.kde.org/pcqljwaod/npqvb1#line-45 ?
Join      atomik (~atomik@109.130.111.188) has joined this channel.
Quit      luiorpe1 (~luis@2001:720:101c:7003:222:43ff:fe44:10a0) has left this server \
(Quit: Konversation terminated!). &lt;eddyb&gt;   maybe it&#39;s off by one? I \
can&#39;t tell &lt;eddyb&gt;   Sho_: lol, I think all the trail bytes checks should \
use &gt;= :P Quit      deltra (~fedex@2001:1388:18c1:e687:e2db:55ff:fea5:2f47) has \
left this server (Quit: Konversation terminated!). &lt;eddyb&gt;   that means \
there&#39;s two byte encodings which can do this. hmmm &lt;Sho_&gt;    eddyb: that \
code was written by Netscape, not us, in our defense ;) &lt;Sho_&gt;    brb

-cycle-

Join    You (~sho@5.28.125.146) have joined the channel #konversation.
Topic   The channel topic is &quot;Konversation 1.5-rc1 has been released! Get it at \
http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation&quot;. Topic   The topic was \
set by Sho_!~sho@kde/hein on 17.03.2013 03:18. URL     Channel URL: \
http://konversation.kde.org/ Mode    Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
&lt;Testkonv&gt; hit me please
&lt;eddyb&gt; ï ?
&lt;Testkonv&gt; it&#39;s the japanese shit
&lt;Testkonv&gt; figures
&lt;eddyb&gt; huh?
&lt;Testkonv&gt; eddyb: it trips line 17
&lt;eddyb&gt; and here I was, thinking I found an off-by-one
&lt;Testkonv&gt; i don&#39;t see why there would be an off by one
&lt;Testkonv&gt; the %0A isn&#39;t in there if you thought that
Quit    e2718 (~hdesk@p54849222.dip0.t-ipconnect.de) has left this server (Quit: \
Konversation terminated!). &lt;eddyb&gt; oh I&#39;m an idiot, the condition is \
inverted from what I thought &lt;Testkonv&gt; no, the problem is that cartman thought \
it would be really smart to make a &quot;is this utf8?&quot; function also do \
&quot;is this a japanese codec?&quot; &lt;eddyb&gt; so the case where i + clen == len \
still works. okay then, time to dive in the japanese detection, I guess \
&lt;Testkonv&gt; eddyb: the problem is that the japanese detection is probably right \
&lt;Testkonv&gt; that byte string might be perfectly valid K_JIS or K_SJIS \
&lt;eddyb&gt; but I want Unicode - and by forcing I meant I switched from Default \
(which I now realized would do nothing, because Default is just an alias) Quit    \
DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has left this server \
(Quit: Konversation terminated!). &lt;Testkonv&gt; yes
Join    DaRTo (~dart@182.red-80-26-211.adsl.dynamic.ccgg.telefonica.net) has joined \
this channel. &lt;Testkonv&gt; the problem is that our decode pipeline decision \
matrix is the wrong way around &lt;Testkonv&gt; this is how it works right now:
&lt;Testkonv&gt; 1. we determine if the byte array can be treated as utf-8, by way of \
a function that checks if it could possibly *not* be utf-8 by way of sanity checking \
and a japanese codec guesser &lt;Testkonv&gt; 2. if we think it can be treated as \
utf-8, we treat it as utf-8 &lt;Testkonv&gt; 3. we determine it can&#39;t be treated \
as utf-8, we look at the user setting &lt;Testkonv&gt; 4. if the user setting is \
utf-8, we fall back to latin-1 since we already ruled out utf-8 &lt;Testkonv&gt; so \
basically, imho it should do this instead: &lt;Testkonv&gt; 1. always use the \
user-setting, unless it&#39;s utf-8 and our sanity-checker says it can&#39;t be utf-8 \
&lt;Testkonv&gt; 2. if our sanity-checker says it can&#39;t be utf-8, try to guess a \
japanese codec. if we guess one, use that &lt;Testkonv&gt; 3. otherwise  fall back to \
latin 1 &lt;Testkonv&gt; that encompasses the same feature set - handling of \
mixed-encoding channels and extra love for japanese chatters - but respects user \
settings and avoids this clash between the wrongly-used utf8 sanity checker and the \
japanese codec guesser &lt;Testkonv&gt; now, it&#39;s pretty clear why the code wound \
up like it did &lt;Testkonv&gt; there was a desire to optimize for performance
&lt;Testkonv&gt; the idea was &quot;if we can tell this is utf-8, let&#39;s not \
re-encode to utf8&quot; &lt;Testkonv&gt; but that was premature optimization and \
wromgly implemented, because &quot;is there any reason this can&#39;t be utf8?&quot; \
isn&#39;t the same as &quot;this is utf8!&quot; &lt;Testkonv&gt; did everyone fall \
from their chairs? Quit    joaoc (~joaoc@131.227.207.40) has left this server (Quit: \
Konversation terminated!). &lt;Testkonv&gt; eddyb: &quot;Default&quot; for a \
*channel* actually means &quot;Use what&#39;s set in the Identity&quot; btw \
&lt;eddyb&gt; logic fallacies &lt;3 &lt;Testkonv&gt; morel ike &quot;premature \
optimization &lt;3&quot; &lt;eddyb&gt; (I was thinking of the part where it&#39;s not \
the same as &quot;this is utf8!&quot;) &lt;Testkonv&gt; whoever wrote the original \
code was trying to hard to be able to use QString::fromUtf8() when he thought he \
could determine it&#39;s utf8 &lt;Testkonv&gt; but that&#39;s dumb
&lt;Testkonv&gt; because the sanity checker does much the same work as the stream \
encoder anyway &lt;Testkonv&gt; (ok, a ltitle less, sure)
&lt;Testkonv&gt; alrighty
&lt;Testkonv&gt; what i&#39;m going to now is put the above into code
&lt;eddyb&gt; Testkonv: thanks a lot :D
&lt;Testkonv&gt; then i&#39;ll dump that on reviewboard with a log of this chat, \
because argnel needs to look at it and he&#39;s at work &lt;PovAddict&gt;     and \
this is why Konversation is a high-quality award-winning app &lt;Testkonv&gt; heh :)
Join    itaylor57 (~itaylor@pool-71-170-156-139.dllstx.fios.verizon.net) has joined \
this channel. &lt;psn&gt;   Testkonv: well actually the original author is pretty \
sure the code has changed over the years ;) &lt;psn&gt;   not that I can remember \
exactly what the original code did... all I know is that there sure was no check for \
japanese codecs :) &lt;Testkonv&gt; psn: nice to see you still lurk about ;)
Part    CounterPillow (~fratti@unaffiliated/counterpillow) has left this channel \
(&quot;Later folks!&quot;). &lt;psn&gt;   that&#39;s what I do best... lurking that \
is :) Join    genstorm (~andreas@80-121-86-191.adsl.highway.telekom.at) has joined \
this channel. &lt;Testkonv&gt; ok, running a build
&lt;Testkonv&gt; eddyb: stand by for testing
&lt;Testkonv&gt; oh ... wait, compile error
Join    luiorpe1 (~luiorpe1@84.126.68.164.dyn.user.ono.com) has joined this channel.
&lt;Testkonv&gt; there we go

-cycle-

Join    You (~sho@5.28.125.146) have joined the channel #konversation.
Topic   The channel topic is &quot;Konversation 1.5-rc1 has been released! Get it at \
http://konversation.kde.org || Get the latest sources: \
http://userbase.kde.org/Konversation/Sources || Wiki: \
http://userbase.kde.org/Konversation || Report bugs and wishes: \
https://bugs.kde.org/enter_bug.cgi?product=konversation&quot;. Topic   The topic was \
set by Sho_!~sho@kde/hein on 17.03.2013 03:18. Mode    Channel modes: +nt
Created This channel was created on 26.11.2006 07:42.
URL     Channel URL: http://konversation.kde.org/
&lt;Testkonv&gt; hit me
&lt;Testkonv&gt; PovAddict, you maybe?
&lt;PovAddict&gt;     I don&#39;t know what I&#39;m supposed to send to trigger it
&lt;PovAddict&gt;     ?
&lt;PovAddict&gt;     that?
&lt;eddyb&gt; ?
&lt;eddyb&gt; (to be sure :P)
&lt;Testkonv&gt; works fine
Quit    mischa (~mischa@84-112-236-94.dynamic.surfer.at) has left this server (Quit: \
Konversation terminated!). &lt;Testkonv&gt; as an added bonus, we&#39;re no longer \
instanciating and destroying the japanese codec guesser for EVERY LINE &lt;nemo&gt;  \
:) &lt;nemo&gt;  huh. why is that not a singleton?
&lt;nemo&gt;  seems like the kind of thing that would be a standalone function
&lt;nemo&gt;  wonder what the codec guesser instantiates 
&lt;Testkonv&gt; nemo: it&#39;s a static member now
</pre>
  </td>
 </tr>
</table>



<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/common.cpp <span style="color: grey">(bd84077)</span></li>

 <li>src/irc/server.h <span style="color: grey">(be9d72d)</span></li>

 <li>src/irc/server.cpp <span style="color: grey">(abb42c2)</span></li>

 <li>src/unicode.cpp <span style="color: grey">(6407b19)</span></li>

</ul>

<p><a href="http://git.reviewboard.kde.org/r/113823/diff/" style="margin-left: \
3em;">View Diff</a></p>







  </td>
 </tr>
</table>








  </div>
 </body>
</html>



_______________________________________________
Konversation-devel mailing list
Konversation-devel@kde.org
https://mail.kde.org/mailman/listinfo/konversation-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic