'Re: How do I use Xerces strings?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       "Steven T. Hatton" <hattons () globalsymmetry ! com>
Date:       2006-03-09 17:53:31
Message-ID: 200603091253.31463.hattons () globalsymmetry ! com
[Download RAW message or body]

On Thursday 09 March 2006 12:08, David Bertoni wrote:

> I don't see how you can get this from the standard.  There is only one
> mention of Unicode, and UTF-16 does not appear anywhere.  The only thing
> I see is a statement about ISO/IEC 10646 and the
> universal-character-name construct.
<quote 
url="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0">
The UTF-16 encoding was chosen because of its widespread industry practice. 
Note that for both HTML and XML, the document character set (and therefore 
the notation of numeric character references) is based on UCS [ISO/IEC 
10646]. A single numeric character reference in a source document may 
therefore in some cases correspond to two 16-bit units in a DOMString (a high 
surrogate and a low surrogate).
</quote>

<quote url="http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets"
[Definition: A character is an atomic unit of text as specified by ISO/IEC 
10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, 
and the legal characters of Unicode and ISO/IEC 10646. The versions of these 
standards cited in A.1 Normative References were current at the time this 
document was prepared. New characters may be added to these standards by 
amendments or new editions. Consequently, XML processors MUST accept any 
character in the range specified for Char.]
</quote>

It's not my fault! ;)

> >> XMLCh is defined to hold UTF-16 code units, which is a much stricter
> >> requirement than anything the C++ standard says about character sets.
> >
> > The C++ Standard only specifies character sets.  It does not specify
> > encodings.
>
> That's exactly my point.  And that's why you can't assume that char is
> encoded in ASCII and wchar_t is encoded in UTF-16.  However, Xerces-C
> guarantees that XMLCh will contain UTF-16 code units.

After further investigation and reflection I have come to the conclusion that 
you're damned if you do, and damend if you don't.  You could convert all your 
data to the implementations character encoding when it's read in, and do the 
reverse when it is stored or transmitted.  UTF-32 under some circumstances 
that might provide some performance advantages.  It would certainly make your 
data compatable with the facilities provided by Standard Library.

I suspect most Xerces derived applications will need to do some kind of codec 
of I/O.  I know I don't want UTF-16 data stored in files I am likely to want 
to edit, or otherwise manipulate outside of Xerces.  If everybody played 
nicely with UTF-16 that would be a different story.

> > No.  That is exactly what I am not assuming.  The example I show above
> > will use whatever encoding my implementation uses for the characters
> > assigned to the XMLCh constants.  As long as my implementation supports
> > the character set specified in UTF-16 (actually UCS-2) Xerces should work
> > using those assignments.
>
> Yes, but that's not very portable.  Perhaps you don't support platforms
> that do not meet this requirement, but Xerces-C does.  By the way, UCS-2
> support is not good enough for Xerces-C, because XML documents can
> contain Unicode characters outside the BMP, which are represented as
> surrogate pairs.

Yes, I see that now.  I believe a conforming C++ implementation is required to 
do the same (for the locales it supports.)

> Why would Xerces-C choose an integral type that's larger than 16 bits
> for its UTF-16 character integral?  If wchar_t is a 32-bit integral,
> then half of all storage allocated for a UTF-16 string would be wasted.

Agreed.

>     Also, Unicode conformance requires that UTF-16 strings use 16-bit
> code units.

Well, all that UTF-16 support actually requires is that it's UTF-16 going in, 
and UTF-16 coming out.

> In addition, users would assume they could call the wide character
> string system functions and expect reasonable results.  That wouldn't
> happen if the system and/or current locale didn't support UTF-16.

Well, you could use the UTF-32 internally, but that puts us back to the 
subject of 50% unused primary storage.  One option might be to have my C++ 
implementation (GCC) explicitly support UTF-16, and then have Xerces compile 
with a flag to use it.

C++ is, in may ways a better language than Java.  UTF support is not one of 
them.  Yes!  I'm frustrated!

Steven 

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic