[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       "Steven T. Hatton" <hattons () globalsymmetry ! com>
Date:       2006-03-09 8:03:45
Message-ID: 200603090303.45697.hattons () globalsymmetry ! com
[Download RAW message or body]

On Thursday 09 March 2006 01:22, David Bertoni wrote:

> That would require that C++ define some integral character type that is
> encoded in UTF-16.  It's unlikely that every compiler vendor would agree
> to do that, although it would certainly make implementing software that
> supports Unicode much easier.

After looking at things more closely, the Standard does - in its typical 
lawyerly language - require that an implementation behave 'as if' it 
supported UTF-16 for all the locales it supports.

> XMLCh is defined to hold UTF-16 code units, which is a much stricter
> requirement than anything the C++ standard says about character sets.

The C++ Standard only specifies character sets.  It does not specify 
encodings.

> > In oder to implement the C++ extended character set, members
> > of the C++ basic character set (ASCII character set) should be defined as
> > wchar_t using their wide character literals.  That is, for example:
> >
> > typedef wchar_t XMLCh;
> >
> > const XMLCh chLatin_A               = L'A';
> > const XMLCh chLatin_B               = L'B';
> > const XMLCh chLatin_C               = L'C';
> > const XMLCh chLatin_D               = L'D';
> >
> > Rather than:
> >
> > typedef unsigned short XMLCh;
> >
> > const XMLCh chLatin_A               = 0x41;
> > const XMLCh chLatin_B               = 0x42;
> > const XMLCh chLatin_C               = 0x43;
> > const XMLCh chLatin_D               = 0x44;
>
> You are making the assumption that the basic character set must be
> encoded in ASCII, but the C++ standard makes no such requirement.

No.  That is exactly what I am not assuming.  The example I show above will 
use whatever encoding my implementation uses for the characters assigned to 
the XMLCh constants.  As long as my implementation supports the character set 
specified in UTF-16 (actually UCS-2) Xerces should work using those 
assignments.

> > There may be reasons the Xerces developers chose to implement UTF-16
> > without conforming to the requirements for implementing the C++ extended
> > character set.  I guess, technically speaking, the encoding of UTF-16 and
> > the extended character set will not, in general, coincide.
>
> I'm not sure I understand what you're saying.  Xerces-C encodes
> character data in UTF-16, and to do that, it uses a 16-bit integral. It
> cannot use wchar_t to hold UTF-16 code units, because there is no
> guarantee that a particular C++ implementation will encode wchar_t in
> UTF-16.  In  fact, there is no requirement that wchar_t even be a 16-bit
> integral

It must be wide enough to encode all the UTF-16 characters of the extended 
character sets required by the implementation's supported locales.  wchar_t 
shall have the same size, singedness and alignment requirements as one of the 
other integral data types.  Can you give an example of a C++ implementation 
that doesn't use a 16 bit (or larger) data type for wchar_t?

> Well, I would hope an ASCII character would be encoded in ASCII.  ;-)
> Perhaps what you really meant was that there is no requirement that the
> basic character set be encoded in ASCII.

The ASCII character set is the collection of alphabetical and punctuation 
symbols encoded by ASCII.

Steven  

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic