[prev in list] [next in list] [prev in thread] [next in thread]
List: xerces-c-dev
Subject: Re: How do I use Xerces strings?
From: David Bertoni <dbertoni () apache ! org>
Date: 2006-03-09 17:08:33
Message-ID: 44106111.1000501 () apache ! org
[Download RAW message or body]
Steven T. Hatton wrote:
> On Thursday 09 March 2006 01:22, David Bertoni wrote:
>
>> That would require that C++ define some integral character type that is
>> encoded in UTF-16. It's unlikely that every compiler vendor would agree
>> to do that, although it would certainly make implementing software that
>> supports Unicode much easier.
>
> After looking at things more closely, the Standard does - in its typical
> lawyerly language - require that an implementation behave 'as if' it
> supported UTF-16 for all the locales it supports.
>
I don't see how you can get this from the standard. There is only one
mention of Unicode, and UTF-16 does not appear anywhere. The only thing
I see is a statement about ISO/IEC 10646 and the
universal-character-name construct.
>> XMLCh is defined to hold UTF-16 code units, which is a much stricter
>> requirement than anything the C++ standard says about character sets.
>
> The C++ Standard only specifies character sets. It does not specify
> encodings.
>
That's exactly my point. And that's why you can't assume that char is
encoded in ASCII and wchar_t is encoded in UTF-16. However, Xerces-C
guarantees that XMLCh will contain UTF-16 code units.
>>> In oder to implement the C++ extended character set, members
>>> of the C++ basic character set (ASCII character set) should be defined as
>>> wchar_t using their wide character literals. That is, for example:
>>>
>>> typedef wchar_t XMLCh;
>>>
>>> const XMLCh chLatin_A = L'A';
>>> const XMLCh chLatin_B = L'B';
>>> const XMLCh chLatin_C = L'C';
>>> const XMLCh chLatin_D = L'D';
>>>
>>> Rather than:
>>>
>>> typedef unsigned short XMLCh;
>>>
>>> const XMLCh chLatin_A = 0x41;
>>> const XMLCh chLatin_B = 0x42;
>>> const XMLCh chLatin_C = 0x43;
>>> const XMLCh chLatin_D = 0x44;
>> You are making the assumption that the basic character set must be
>> encoded in ASCII, but the C++ standard makes no such requirement.
>
> No. That is exactly what I am not assuming. The example I show above will
> use whatever encoding my implementation uses for the characters assigned to
> the XMLCh constants. As long as my implementation supports the character set
> specified in UTF-16 (actually UCS-2) Xerces should work using those
> assignments.
>
Yes, but that's not very portable. Perhaps you don't support platforms
that do not meet this requirement, but Xerces-C does. By the way, UCS-2
support is not good enough for Xerces-C, because XML documents can
contain Unicode characters outside the BMP, which are represented as
surrogate pairs.
>>> There may be reasons the Xerces developers chose to implement UTF-16
>>> without conforming to the requirements for implementing the C++ extended
>>> character set. I guess, technically speaking, the encoding of UTF-16 and
>>> the extended character set will not, in general, coincide.
>> I'm not sure I understand what you're saying. Xerces-C encodes
>> character data in UTF-16, and to do that, it uses a 16-bit integral. It
>> cannot use wchar_t to hold UTF-16 code units, because there is no
>> guarantee that a particular C++ implementation will encode wchar_t in
>> UTF-16. In fact, there is no requirement that wchar_t even be a 16-bit
>> integral
>
> It must be wide enough to encode all the UTF-16 characters of the extended
> character sets required by the implementation's supported locales. wchar_t
> shall have the same size, singedness and alignment requirements as one of the
> other integral data types. Can you give an example of a C++ implementation
> that doesn't use a 16 bit (or larger) data type for wchar_t?
Why would Xerces-C choose an integral type that's larger than 16 bits
for its UTF-16 character integral? If wchar_t is a 32-bit integral,
then half of all storage allocated for a UTF-16 string would be wasted.
Also, Unicode conformance requires that UTF-16 strings use 16-bit
code units.
In addition, users would assume they could call the wide character
string system functions and expect reasonable results. That wouldn't
happen if the system and/or current locale didn't support UTF-16.
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic