[prev in list] [next in list] [prev in thread] [next in thread]
List: xerces-c-dev
Subject: Re: How do I use Xerces strings?
From: "Steven T. Hatton" <hattons () globalsymmetry ! com>
Date: 2006-03-09 8:03:45
Message-ID: 200603090303.45697.hattons () globalsymmetry ! com
[Download RAW message or body]
On Thursday 09 March 2006 01:22, David Bertoni wrote:
> That would require that C++ define some integral character type that is
> encoded in UTF-16. It's unlikely that every compiler vendor would agree
> to do that, although it would certainly make implementing software that
> supports Unicode much easier.
After looking at things more closely, the Standard does - in its typical
lawyerly language - require that an implementation behave 'as if' it
supported UTF-16 for all the locales it supports.
> XMLCh is defined to hold UTF-16 code units, which is a much stricter
> requirement than anything the C++ standard says about character sets.
The C++ Standard only specifies character sets. It does not specify
encodings.
> > In oder to implement the C++ extended character set, members
> > of the C++ basic character set (ASCII character set) should be defined as
> > wchar_t using their wide character literals. That is, for example:
> >
> > typedef wchar_t XMLCh;
> >
> > const XMLCh chLatin_A = L'A';
> > const XMLCh chLatin_B = L'B';
> > const XMLCh chLatin_C = L'C';
> > const XMLCh chLatin_D = L'D';
> >
> > Rather than:
> >
> > typedef unsigned short XMLCh;
> >
> > const XMLCh chLatin_A = 0x41;
> > const XMLCh chLatin_B = 0x42;
> > const XMLCh chLatin_C = 0x43;
> > const XMLCh chLatin_D = 0x44;
>
> You are making the assumption that the basic character set must be
> encoded in ASCII, but the C++ standard makes no such requirement.
No. That is exactly what I am not assuming. The example I show above will
use whatever encoding my implementation uses for the characters assigned to
the XMLCh constants. As long as my implementation supports the character set
specified in UTF-16 (actually UCS-2) Xerces should work using those
assignments.
> > There may be reasons the Xerces developers chose to implement UTF-16
> > without conforming to the requirements for implementing the C++ extended
> > character set. I guess, technically speaking, the encoding of UTF-16 and
> > the extended character set will not, in general, coincide.
>
> I'm not sure I understand what you're saying. Xerces-C encodes
> character data in UTF-16, and to do that, it uses a 16-bit integral. It
> cannot use wchar_t to hold UTF-16 code units, because there is no
> guarantee that a particular C++ implementation will encode wchar_t in
> UTF-16. In fact, there is no requirement that wchar_t even be a 16-bit
> integral
It must be wide enough to encode all the UTF-16 characters of the extended
character sets required by the implementation's supported locales. wchar_t
shall have the same size, singedness and alignment requirements as one of the
other integral data types. Can you give an example of a C++ implementation
that doesn't use a 16 bit (or larger) data type for wchar_t?
> Well, I would hope an ASCII character would be encoded in ASCII. ;-)
> Perhaps what you really meant was that there is no requirement that the
> basic character set be encoded in ASCII.
The ASCII character set is the collection of alphabetical and punctuation
symbols encoded by ASCII.
Steven
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic