'RE: UTF-8 Macros (Was: number of chars in a UTF-8 string'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       icu
Subject:    RE: UTF-8 Macros (Was: number of chars in a UTF-8 string
From:       "Carl W. Brown" <cbrown () xnetinc ! com>
Date:       2001-09-29 0:14:23
[Download RAW message or body]

Markus,

> -----Original Message-----
> From: icu-admin@www-124.southbury.usf.ibm.com
> [mailto:icu-admin@www-124.southbury.usf.ibm.com]On Behalf Of Markus
> Scherer
> Sent: Friday, September 28, 2001 11:35 AM
> To: icu list
> Subject: Re: UTF-8 Macros (Was: number of chars in a UTF-8 string
>
>
> "Carl W. Brown" wrote:
> > What is the minimum-length check?
>
> It's the check that, for example, U+0000 is not encoded as two
> non-zero bytes (c0 80).
> Unicode 3.0.1 and up requires this check.

Take for example UTF8_FWD_1_SAFE() if you did a non-shortest form check what
would happen?  You could not move 0 bytes because then our character counter
would loop.  You can not move 1 byte because that would place you into the
middle of a character.  You would move 2 bytes which is exactly what you
would do if you did not check for the shortest form.

>
> > I avoid the SAFE macros because they as so slooooooooooow.
>
> Only for UTF-8, and only because that is sooo complex.
>
> > I can always
> > live with the minor hit of adding a zero index to my pointer
> but the safe
> > routines do so much checking that I do not need.
>
> Different people make different choices; again, I wish at this
> point that I had not added the UTF-8/32 macros.
> ICU uses UTF-16, which is efficient.
>
> I don't care too much if one could make different assumptions for
> UTF-8 that would allow faster UTF-8 macros with still a remnant
> of error checking.
> These are truely helpers, and most ICU users won't need them,
> especially now with the new string transformation functions.
>
> > If you keep the data in UTF-8 I would presume that
> > you need a full set of UTF-8 services.
>
> If you keep the data in UTF-8 and process it as such, then you
> are not using ICU, but are using some other library like (g)libc
> with its services.

I support processing done in UTF-8 and use ICU services where appropriate.
For example if you are assembling text from various UTF-8 sources it is a
mess to convert everything to UTF-16 and then back.  If you need an
occasional collation then you need ICU services for two reasons.  One it is
platform independent and two it is thread safe.  Most locale implementations
are not thread safe.

> Not for ICU 2.0, but how would people feel about deprecating the UTF-8/32
macros?

I use UTF8_IS_TRAIL quite a bit.

Carl


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic