[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       David Bertoni <dbertoni () apache ! org>
Date:       2006-03-09 6:22:08
Message-ID: 440FC990.7000608 () apache ! org
[Download RAW message or body]

Steven T. Hatton wrote:
> On Wednesday 08 March 2006 02:18, Scott Cantor wrote:
> 
>>> IIRC, there /are/ different UTF encodings, even within UTF-16.
>>> There is something called UCS-4, and also something called UCS-2 (I
>>> believe). I do not know the difference between these and their related
>>> UTF-32 and UTF-16.
>> Nor I, but that's what I had in mind when I expressed caution.
> 
> To my mind, the failure to specify a UTF-16 string class is one of the worst 
> aspects of C++.

That would require that C++ define some integral character type that is 
encoded in UTF-16.  It's unlikely that every compiler vendor would agree 
to do that, although it would certainly make implementing software that 
supports Unicode much easier.

> After reading the applicable sections of ISO/IEC 14882:2003, 
> I have come to the conclusion that the Xerces XMLCh is not defined in such a 
> way as to conform to the definition of a C++ implementation's extended 
> character set.

XMLCh is defined to hold UTF-16 code units, which is a much stricter 
requirement than anything the C++ standard says about character sets.

> In oder to implement the C++ extended character set, members 
> of the C++ basic character set (ASCII character set) should be defined as 
> wchar_t using their wide character literals.  That is, for example:
> 
> typedef wchar_t XMLCh;
> 
> const XMLCh chLatin_A               = L'A';
> const XMLCh chLatin_B               = L'B';
> const XMLCh chLatin_C               = L'C';
> const XMLCh chLatin_D               = L'D';
> 
> Rather than:
> 
> typedef unsigned short XMLCh;
> 
> const XMLCh chLatin_A               = 0x41;
> const XMLCh chLatin_B               = 0x42;
> const XMLCh chLatin_C               = 0x43;
> const XMLCh chLatin_D               = 0x44;
> 

You are making the assumption that the basic character set must be 
encoded in ASCII, but the C++ standard makes no such requirement.

> There may be reasons the Xerces developers chose to implement UTF-16 without 
> conforming to the requirements for implementing the C++ extended character 
> set.  I guess, technically speaking, the encoding of UTF-16 and the extended 
> character set will not, in general, coincide.

I'm not sure I understand what you're saying.  Xerces-C encodes 
character data in UTF-16, and to do that, it uses a 16-bit integral. It 
cannot use wchar_t to hold UTF-16 code units, because there is no 
guarantee that a particular C++ implementation will encode wchar_t in 
UTF-16.  In  fact, there is no requirement that wchar_t even be a 16-bit 
integral

> That is, there is no requirement that the ASCII character set be
> encoded using ASCII values. In such a case, then the numerical value
> of chLatin_A would not be the  same in all implementations.

Well, I would hope an ASCII character would be encoded in ASCII.  ;-) 
Perhaps what you really meant was that there is no requirement that the 
basic character set be encoded in ASCII.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic