'MirOS OPTU encoding (was Re: Unicode PUA mapping F000..F7FF and'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       miros-discuss
Subject:    MirOS OPTU encoding (was Re: Unicode PUA mapping F000..F7FF and
From:       Thorsten Glaser <tg () mirbsd ! de>
Date:       2008-08-03 21:18:29
Message-ID: Pine.BSM.4.64L.0808032103450.27825 () herc ! mirbsd ! org
[Download RAW message or body]

Dixi:

>This is a request for both LANANA and the ConScript Unicode Registry to
>assign the two ranges to a purpose outlined below.

LANANA might want to include CSUR's ranges in their list in general,
as the CSUR has included these already used by Linux.

>I need a range of 128 codepoints for an application defined below

This was (for reference, so you don't need to grab the older mails)
allocated at U+EF80‥U+EFFF.

>This character set shall be encoded in two different ways. One of these
>is called <xxx>-16 (where <xxx> is the name of the encoding group, which
>we have not yet decided upon, but we will inform you once this is done),

The names are now OPTU-8 and OPTU-16 (MirOS OPTU = Octet Pass-Through
encoding for Unicode).

>To avoid wcrtomb(3) to throw EILSEQ

Since wcrtomb(3) has other error conditions as well, I decided we can
throw EILSEQ on 0xFFFE and 0xFFFF input, besides these are, AFAIK, no
valid Unicode codepoints anyway.

> […] *and* if we can
>codify this behaviour in wcrtomb(3) and wcsrtombs(3) successfully.

There are now two new functions:
‣ http://cvs.mirbsd.de/src/kern/c/optu16to8.c
‣ http://cvs.mirbsd.de/src/kern/c/optu8to16.c

optu16to8 is equivalent to wcrtomb(3) in MirOS, as we have only one
locale and the function syntax and semantics are 100% the same if a
locale such as "en_US.OPTU-8" is set. (In MirOS, we pretend to use
"en_US.UTF-8" instead, to benefit application support.)

optu8to16 has slightly different semantics, which are outlined in
http://www.mirbsd.org/man3/optu8to16 in the STANDARDS section; our
mbrtowc(3) has been converted to be implemented "on top" of it:
‣ http://cvs.mirbsd.de/src/lib/libc/i18n/mbrtowc.c
The changes were basically to ignore the ‘n' argument if ‘s' is
NULL, and if the function returns 0 to check if ‘*pwc' is L'\0';
if not to throw EILSEQ.

I had to draw a state diagram and get the idea of rejecting input
(via return (0);) to make sure our 14 bits of mbstate_t are still
enough (they are split into 2+12 bits with no encoding of the di-
rection (wctomb or mbtowc) used; the wctomb case needed no changes
at all, the mbtowc case needs the unused 11 combo (00 01 10 were
used) in the 2 bit field and 8 bits in the lower field if that is
set, so yes, no ABI change needed). Our libc functions like fgetwc
have started to be converted to use optu8to16 instead of mbrtowc
now (mbsrtowcs will require some more designing).

Thanks for your feedback, especially the hints on sparse arrays
(I had thought of an array of ranges instead) and what all can NOT
be done with UTF-16 (and, as such, CESU-8, OPTU-8 and OPTU-16).

bye,
//mirabilos
-- 
Sometimes they [people] care too much: pretty printers [and syntax highligh-
ting, d.A.] mechanically produce pretty output that accentuates irrelevant
detail in the program, which is as sensible as putting all the prepositions
in English text in bold font.	-- Rob Pike in "Notes on Programming in C"

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic