'Re: How do I use Xerces strings?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       David Bertoni <dbertoni () apache ! org>
Date:       2006-03-09 19:42:35
Message-ID: 4410852B.3020702 () apache ! org
[Download RAW message or body]

Steven T. Hatton wrote:
> On Thursday 09 March 2006 12:08, David Bertoni wrote:
> 
>> I don't see how you can get this from the standard.  There is only one
>> mention of Unicode, and UTF-16 does not appear anywhere.  The only thing
>> I see is a statement about ISO/IEC 10646 and the
>> universal-character-name construct.
> <quote 
> url="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0">
> The UTF-16 encoding was chosen because of its widespread industry practice. 
> Note that for both HTML and XML, the document character set (and therefore 
> the notation of numeric character references) is based on UCS [ISO/IEC 
> 10646]. A single numeric character reference in a source document may 
> therefore in some cases correspond to two 16-bit units in a DOMString (a high 
> surrogate and a low surrogate).
> </quote>
> 
> <quote url="http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets"
> [Definition: A character is an atomic unit of text as specified by ISO/IEC 
> 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, 
> and the legal characters of Unicode and ISO/IEC 10646. The versions of these 
> standards cited in A.1 Normative References were current at the time this 
> document was prepared. New characters may be added to these standards by 
> amendments or new editions. Consequently, XML processors MUST accept any 
> character in the range specified for Char.]
> </quote>
> 
> It's not my fault! ;)
> 
>>>> XMLCh is defined to hold UTF-16 code units, which is a much stricter
>>>> requirement than anything the C++ standard says about character sets.
>>> The C++ Standard only specifies character sets.  It does not specify
>>> encodings.
>> That's exactly my point.  And that's why you can't assume that char is
>> encoded in ASCII and wchar_t is encoded in UTF-16.  However, Xerces-C
>> guarantees that XMLCh will contain UTF-16 code units.
> 
> After further investigation and reflection I have come to the conclusion that 
> you're damned if you do, and damend if you don't.  You could convert all your 
> data to the implementations character encoding when it's read in, and do the 
> reverse when it is stored or transmitted.  UTF-32 under some circumstances 
> that might provide some performance advantages.  It would certainly make your 
> data compatable with the facilities provided by Standard Library.
> 

I'm not sure what the performance advantages of UTF-32 would be over 
UTF-16, unless you are referring to the handling of surrogate pairs.  I 
would imagine that the disadvantage of up to 16 bits of wasted storage 
overhead would overwhelm the advantage gained from avoiding surrogates.

Also, can you explain why you believe UTF-32 would provide better 
compatibility with the facilities provided by the standard library?  On 
Windows, this is certainly not the case.  It might provide better 
compatibility on some platforms operating with a locale that encodes 
wchar_t in UTF-32, but that's not very portable.

> I suspect most Xerces derived applications will need to do some kind of codec 
> of I/O.  I know I don't want UTF-16 data stored in files I am likely to want 
> to edit, or otherwise manipulate outside of Xerces.  If everybody played 
> nicely with UTF-16 that would be a different story.
>

Representing the full range of Unicode characters is difficult, no 
matter how you encode them.  I know lots of applications that play 
nicely with UTF-16.  In other cases, UTF-8 is a better choice, since it 
maintains better compatibility with applications that expect ASCII data.

Some applications that use Xerces-C to parse XML files eventually 
re-serialize the data to some encoding they prefer, perhaps even to the 
original encoding.

>>> No.  That is exactly what I am not assuming.  The example I show above
>>> will use whatever encoding my implementation uses for the characters
>>> assigned to the XMLCh constants.  As long as my implementation supports
>>> the character set specified in UTF-16 (actually UCS-2) Xerces should work
>>> using those assignments.
>> Yes, but that's not very portable.  Perhaps you don't support platforms
>> that do not meet this requirement, but Xerces-C does.  By the way, UCS-2
>> support is not good enough for Xerces-C, because XML documents can
>> contain Unicode characters outside the BMP, which are represented as
>> surrogate pairs.
> 
> Yes, I see that now.  I believe a conforming C++ implementation is required to 
> do the same (for the locales it supports.)
> 
>> Why would Xerces-C choose an integral type that's larger than 16 bits
>> for its UTF-16 character integral?  If wchar_t is a 32-bit integral,
>> then half of all storage allocated for a UTF-16 string would be wasted.
> 
> Agreed.
> 
>>     Also, Unicode conformance requires that UTF-16 strings use 16-bit
>> code units.
> 
> Well, all that UTF-16 support actually requires is that it's UTF-16 going in, 
> and UTF-16 coming out.

I'm not sure what you mean by this, but my reading of the Unicode 
standard says that UTF-16 sequences are composed of UTF-16 code units, 
and a UTF-16 code unit is defined as a 16-bit unit of storage.  So it 
would not be conformant to use a 32-bit unit of storage for a UTF-16 
code unit in the APIs.

> 
>> In addition, users would assume they could call the wide character
>> string system functions and expect reasonable results.  That wouldn't
>> happen if the system and/or current locale didn't support UTF-16.
> 
> Well, you could use the UTF-32 internally, but that puts us back to the 
> subject of 50% unused primary storage.  One option might be to have my C++ 
> implementation (GCC) explicitly support UTF-16, and then have Xerces compile 
> with a flag to use it.
> 

You can certainly do that, as long as you can limit the platforms you 
support to those where you can rely on the compiler and run-time library 
to support UTF-16.

> C++ is, in may ways a better language than Java.  UTF support is not one of 
> them.  Yes!  I'm frustrated!
> 

I agree.  I would be very helpful if the next C++ standard defined:

1. A unique 16-bit integral for UTF-16 code units.
2. Support in the library for std::basic_string instantiated with that type.
3. Some lexical construct at the source code level for character 
literals and character string literals that produce characters and 
strings encoded in UTF-16 .
4. Run-time library support for arrays of this type, providing full 
support for Unicode.

But I suspect that's just a dream.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic