[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xmonad
Subject:    Re: [xmonad] spawn functions are not unicode safe
From:       Gwern Branwen <gwern0 () gmail ! com>
Date:       2009-01-15 17:04:11
Message-ID: cbf55b100901150904v39b159a8nbe485d05f0e25dfd () mail ! gmail ! com
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Thu, Jan 15, 2009 at 11:04 AM, Khudyakov Alexey  wrote:
> On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
>> RFC 3629 [1] states:
>>
>>    o  UTF-8 strings can be fairly reliably recognized as such by a
>>       simple algorithm, i.e., the probability that a string of
>>       characters in any other encoding appears as valid UTF-8 is low,
>>       diminishing with increasing string length.
>>
>> However, no references to the algorithm itself are given.
>>
>> Google brought me this sample algorithm [2].
>> Probably it's worth to implement something like that and include into
>> utf8-string if it's not already there.
>>
>>   1. http://www.ietf.org/rfc/rfc3629.txt
>>   2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
>
> Something like this? (code below) Algorithm is trivial — check for impossible
> bytes combinations. If there is no such bytes, pairs etc. byte sequence is
> probably UTF8 encoded string.
>
> But problem not with decoding unicode strings i.e. not with functions like
> fromUnicode :: [Word8] -> [Char]
> but with encoding of string. Char represent unicode symbol, and thus
> everything OK at this point. However unix system calls know nothing about
> unicode and accept (char*) or [Word8] in haskell terminology.
>
> And conversion from [Char] to [Word8] is problem. It arise whenever haskell
> need to pass some string to outside world.  Currently Char simply truncated
> to one byte regardless of its value. Its because of that `encode' function is
> needed. Not only executeFile affected.
>
>> import Control.Monad
>> import Data.Word
>> import Data.Bits
>> import Data.Maybe
>>
>> is11,is10,is0x :: Word8 -> Bool
>> is11 b = (b `shiftR` 6) == 3
>> is10 b = (b `shiftR` 6) == 2
>> is0x b = b >
>> -- Test if pair allowed in UTF8 encoded string.
>> validPair :: Word8 -> Word8 -> Maybe Word8
>> validPair a b = if (b >                                      (is11
a && (not $ is10 b)))
>>                 then Just b
>>                 else Nothing
>>
>> -- Check if sequence of bytes UTF8 encoded string. Note that this
>> -- check is probabilistic. If function returns False this string is
>> -- not UTF8. If it return True string still may fail to decode.
>> isUTF8 :: [Word8] -> Bool
>> isUTF8 = isJust . foldM validPair 0
>>

Perhaps we're over-thinking all this. Is it a problem in any way to
run encodeString over a String that is just normal ASCII (that is, no
funky Unicode)?

Eric: could we just mindlessly call encodeString on everything going
into spawn/safeSpawn?

- --
gwern
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEAREKAAYFAklvbIoACgkQvpDo5Pfl1oIGOACfQoSjID/uj/UqFLcFrnAd1m1X
nWIAnRkfzdTP70bhKB5eMM37/E4EryH4
=4no0
-----END PGP SIGNATURE-----
_______________________________________________
xmonad mailing list
xmonad@haskell.org
http://www.haskell.org/mailman/listinfo/xmonad

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic