[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl5-porters
Subject:    Re: RFC: Processing Unicode non-characters and code points beyond Unicode's
From:       SADAHIRO Tomoyuki <bqw10602 () nifty ! com>
Date:       2010-11-30 14:38:49
Message-ID: 20101130233849.57C3.CB027F2D () nifty ! com
[Download RAW message or body]

On Thu, 25 Nov 2010 10:23:17 -0700
karl williamson wrote:

> I think we have agreement here, but let me sum up to be sure.
> 
> 1) The current API will change (because it doesn't really have the 
> capability to do things properly) so that by default the internal utf8 
> encoding/decoding functions will allow non-character code points and 
> above-Unicode code points.  The default for surrogates will continue to 
> be that they are not allowed.  It will be possible to specify 
> disallowing non-characters and beyond-Unicode characters by appropriate 
> flags.  (Actually, the current API for utf8n_to_uvuni() always allows 
> above-Unicode code points; I would extend it to allow excluding these.) 
>   Existing macros that match subsets of the non-character code points 
> will be removed and replaced by a single macro with a new name that 
> matches all of them.

Though I don't object defining a new flag macro that makes
utf8n_to_uvuni() will disallow beyond-Unicode (uv >= 0x110000)
and, if necessary, changing the flags passed to utf8n_to_uvuni()
called in perl core,
I guess removal of any existing macro, that has been long-standing
since perl 5.7.x or around 5.8.0, has a problem of backward compatibity.

The removal of an existing macro makes any XS code using the macro
can't be built.

The API doc for utf8n_to_uvuni() in perl 5.12.2 (latest maint)
states (see http://perldoc.perl.org/perlapi.html#utf8n_to_uvuni )

     If s does not point to a well-formed UTF-8 character,
     the behaviour is dependent on the value of flags :
     [snip]
     The flags can also contain various flags to allow
     deviations from the strict UTF-8 encoding (see utf8.h).

     UV utf8n_to_uvuni(const U8 *s, STRLEN curlen,
                                    STRLEN *retlen, U32 flags)

and then this document seems to allow for perl users to use the macros
defined in utf8.h as flags passed to utf8n_to_uvuni().

Regards,
SADAHIRO Tomoyuki


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic