'Re: How do I use Xerces strings?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       David Bertoni <dbertoni () apache ! org>
Date:       2006-03-09 17:08:33
Message-ID: 44106111.1000501 () apache ! org
[Download RAW message or body]

Steven T. Hatton wrote:
> On Thursday 09 March 2006 01:22, David Bertoni wrote:
> 
>> That would require that C++ define some integral character type that is
>> encoded in UTF-16.  It's unlikely that every compiler vendor would agree
>> to do that, although it would certainly make implementing software that
>> supports Unicode much easier.
> 
> After looking at things more closely, the Standard does - in its typical 
> lawyerly language - require that an implementation behave 'as if' it 
> supported UTF-16 for all the locales it supports.
> 

I don't see how you can get this from the standard.  There is only one 
mention of Unicode, and UTF-16 does not appear anywhere.  The only thing 
I see is a statement about ISO/IEC 10646 and the 
universal-character-name construct.

>> XMLCh is defined to hold UTF-16 code units, which is a much stricter
>> requirement than anything the C++ standard says about character sets.
> 
> The C++ Standard only specifies character sets.  It does not specify 
> encodings.
> 

That's exactly my point.  And that's why you can't assume that char is 
encoded in ASCII and wchar_t is encoded in UTF-16.  However, Xerces-C 
guarantees that XMLCh will contain UTF-16 code units.

>>> In oder to implement the C++ extended character set, members
>>> of the C++ basic character set (ASCII character set) should be defined as
>>> wchar_t using their wide character literals.  That is, for example:
>>>
>>> typedef wchar_t XMLCh;
>>>
>>> const XMLCh chLatin_A               = L'A';
>>> const XMLCh chLatin_B               = L'B';
>>> const XMLCh chLatin_C               = L'C';
>>> const XMLCh chLatin_D               = L'D';
>>>
>>> Rather than:
>>>
>>> typedef unsigned short XMLCh;
>>>
>>> const XMLCh chLatin_A               = 0x41;
>>> const XMLCh chLatin_B               = 0x42;
>>> const XMLCh chLatin_C               = 0x43;
>>> const XMLCh chLatin_D               = 0x44;
>> You are making the assumption that the basic character set must be
>> encoded in ASCII, but the C++ standard makes no such requirement.
> 
> No.  That is exactly what I am not assuming.  The example I show above will 
> use whatever encoding my implementation uses for the characters assigned to 
> the XMLCh constants.  As long as my implementation supports the character set 
> specified in UTF-16 (actually UCS-2) Xerces should work using those 
> assignments.
>

Yes, but that's not very portable.  Perhaps you don't support platforms 
that do not meet this requirement, but Xerces-C does.  By the way, UCS-2 
support is not good enough for Xerces-C, because XML documents can 
contain Unicode characters outside the BMP, which are represented as 
surrogate pairs.

>>> There may be reasons the Xerces developers chose to implement UTF-16
>>> without conforming to the requirements for implementing the C++ extended
>>> character set.  I guess, technically speaking, the encoding of UTF-16 and
>>> the extended character set will not, in general, coincide.
>> I'm not sure I understand what you're saying.  Xerces-C encodes
>> character data in UTF-16, and to do that, it uses a 16-bit integral. It
>> cannot use wchar_t to hold UTF-16 code units, because there is no
>> guarantee that a particular C++ implementation will encode wchar_t in
>> UTF-16.  In  fact, there is no requirement that wchar_t even be a 16-bit
>> integral
> 
> It must be wide enough to encode all the UTF-16 characters of the extended 
> character sets required by the implementation's supported locales.  wchar_t 
> shall have the same size, singedness and alignment requirements as one of the 
> other integral data types.  Can you give an example of a C++ implementation 
> that doesn't use a 16 bit (or larger) data type for wchar_t?

Why would Xerces-C choose an integral type that's larger than 16 bits 
for its UTF-16 character integral?  If wchar_t is a 32-bit integral, 
then half of all storage allocated for a UTF-16 string would be wasted. 
    Also, Unicode conformance requires that UTF-16 strings use 16-bit 
code units.

In addition, users would assume they could call the wide character 
string system functions and expect reasonable results.  That wouldn't 
happen if the system and/or current locale didn't support UTF-16.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic