'Re: [Haskell-cafe] Ready for testing: Unicode support for Handle I/O'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       haskell-cafe
Subject:    Re: [Haskell-cafe] Ready for testing: Unicode support for Handle I/O
From:       Duncan Coutts <duncan.coutts () worc ! ox ! ac ! uk>
Date:       2009-02-03 22:56:13
Message-ID: 1233701773.26754.590.camel () localhost
[Download RAW message or body]

On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote:

> Will there also be something to handle the UTF-16 BOM marker?  I'm not
> sure what the best API for that is, since it may or may not be present,
> but it should be considered -- and could perhaps help autodetect encoding.

I think someone else mentioned this already, but utf16 (as opposed to
utf16be/le) will use the BOM if its present.

I'm not quite sure what happens when you switch encoding, presumably
it'll accept and consider a BOM at that point.

> > Thanks to suggestions from Duncan Coutts, it's possible to call
> > hSetEncoding even on buffered read Handles, and the right thing
> > happens.  So we can read from text streams that include multiple
> > encodings, such as an HTTP response or email message, without having
> > to turn buffering off (though there is a penalty for switching
> > encodings on a buffered Handle, as the IO system has to do some
> > re-decoding to figure out where it should start reading from again).
> 
> Sounds useful, but is this the bit that causes the 30% performance hit?

No. You only pay that penalty if you switch encoding. The standard case
has no extra cost.

> > Performance is about 30% slower on "hGetContents >>= putStr" than
> > before.  I've profiled it, and about 25% of this is in doing the
> > actual encoding/decoding, the rest is accounted for by the fact that
> > we're shuffling around 32-bit chars rather than bytes in the Handle
> > buffer, so there's not much we can do to improve this.
> 
> Does this mean that if we set the encoding to latin1, tat we should see
> performance 5% worse than present?

No, I think that's 30% for latin1. The cost is not really the character
conversion but the copying from a byte buffer via iconv to a char
buffer.

> 30% slower is a big deal, especially since we're not all that speedy now.

Bear in mind that's talking about the [Char] interface, and nobody using
that is expecting great performance. We already have an API for getting
big chunks of bytes out of a Handle, with the new Handle we'll also want
something equivalent for a packed text representation. Hopefully we can
get something nice with the new text package.

Duncan

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
[prev in list] [next in list] [prev in thread] [next in thread]