'UTF-8 related exmh bugs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       exmh-workers
Subject:    UTF-8 related exmh bugs
From:       Markus Kuhn <Markus.Kuhn () cl ! cam ! ac ! uk>
Date:       2001-05-06 22:10:52
[Download RAW message or body]

Here various UTF-8 related bugs that I've found in exmh 2.3:

A) When I reply to a UTF-8 message and quote its content, I get garbled
text displayed in sedit that looks like UTF-8 text decoded as ISO 8859-1
text. If I manually append "; charset=UTF-8" to text/plain in the MIME
header, the resulting message will still be received correctly.

B) When I cut&paste non-ASCII text characters (e.g., lines from a
received UTF-8 message) into an sedit windows, then they are displayed
correctly. Apparently, Tk is able to cut&paste arbitrary UTF-8 text
between its widgets. However, when I send off the message,
"charset=us-ascii" gets appended in the MIME header, and in the outgoing
message, all the non-ASCII characters have been replaced by question
marks.:-(( If I add manually "; charset=UTF-8" to the MIME header, the
non-ASCII characters still get replaced by question marks. :-((

I am not sure exactly what is going on, but I suspect that exmh applies
mechanisms that might have been appropriate when TCL (< 8.1) encoded
strings in some selectable 8-bit encoding, before it switched everything
to Unicode. What should happen today when an sedit message is sent is
the following:

      a) Check, what Unicode characters are found in the buffer.

      b) Test if only characters in the range U0000..U007F are found,
         and if yes, add charset=US-ASCII to the MIME header and send
         out the buffer unmodified.

      c) Test if only characters in the range U0000..U00FF are found,
         and if yes, add charset=ISO-8859-1 to the MIME header and
         send out the buffer after passing it through a UTF-8->ISO-8859-1
         converter.

      d) Optional: Test if only characters in the range of some optionally
         specifiable legacy encoding XXX (like ISO 8859-7) are found, and
         if yes, add the name of this encoding XXX to the MIME header
         and send out the buffer after sending it through a UTF-8->XXX
         converter.

      e) In all other cases, add charset=UTF-8 to the MIME header and send
         out the buffer unmodified.

It seems b) is already implemented, but there is at the moment nothing
that detects that the sedit buffer can only be sent out as UTF-8 without
loss of information, labels the UTF-8 MIME header appropriately, and
doesn't attempt an unnecessary and destructive conversion before
sending.

https://sourceforge.net/tracker/?func=detail&aid=421898&group_id=12885&atid=112885

C) When I try to cut&paste from an sedit window to a UTF-8 xterm, I also
get question marks for non-Latin-1 characters.

I suspect, that is because Tk doesn't support yet the new UTF8_STRING
property type and selection target. It is just like the old STRING type
and target defined in the ICCCM, but uses UTF-8 instead of ISO 8859-1.
Probably more something that will have to be fixed in Tk.

https://sourceforge.net/tracker/?func=detail&aid=418653&group_id=12997&atid=362997

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>




[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic