'Re: r27106 - docs/Perl6/Spec'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl6-language
Subject:    Re: r27106 - docs/Perl6/Spec
From:       Aaron Crane <perl () aaroncrane ! co ! uk>
Date:       2009-06-18 10:00:31
Message-ID: 20090618100031.GB7675 () aaroncrane ! co ! uk
[Download RAW message or body]

pugs-commits@feather.perl6.nl writes:
> +The C<utf8> type is derived from C<buf8>, with the additional constraint
> +that it may only contain validly encoded UTF-8.  Likewise, C<utf16> is
> +derived from C<buf16>, and C<utf32> from C<buf32>.

What does "validly encoded UTF-8" mean in this context?  The following
questions come to mind:

1.  Four-byte UTF-8 sequences are enough to handle any Unicode
    character.  Are the obvious five- and six-byte extensions
    permitted?  If so, how about a seven-byte extension (needed to
    allow any 32-bit value to be encoded)?

    Whichever sequence length is chosen, is there an additional
    constraint on the maximum permitted codepoint?  For example,
    four-byte UTF-8 sequences can easily represent values up to
    0x1f_ffff, but Unicode stops at 0x10_ffff.  Or if seven-byte
    sequences are permitted, are codepoints limited to 2**32-1?

2.  Are over-wide encoded sequences (0xC0 0x41 for U+0041, and so on)
    permitted?  (I hope not.)

3.  Are encoded codepoints corresponding to UTF-16 surrogates permitted?

4.  Are noncharacter codepoints (0xFFFE, 0xFFFF, etc) permitted?

5.  Are unallocated codepoints permitted?  If so, that doesn't seem
    very "valid"; but if not, a program's behaviour might change under
    a newer version of Unicode.  Perhaps programs should be given the
    opportunity to declare which Unicode version's list of allocated
    characters they want.

6.  Are values that begin with combining characters permitted?

Of those, question (3) applies to UTF-32, and questions (4), (5), and
(6) to both UTF-16 and UTF-32.  Further, a variant of (1) applies to
UTF-32: are code units greater than 0x10FFFF permitted?

I assume that the C<utf16> type forbids invalid surrogate sequences.

I'm also tempted to suggest that the type names should be C<utf-8>,
C<utf-16>, C<utf-32>.

-- 
Aaron Crane ** http://aaroncrane.co.uk/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic