[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       David Bertoni <dbertoni () apache ! org>
Date:       2006-03-09 19:14:31
Message-ID: 44107E97.80108 () apache ! org
[Download RAW message or body]

Steven T. Hatton wrote:
> On Thursday 09 March 2006 12:16, David Bertoni wrote:
>> Steven T. Hatton wrote:
> 
>>> wchar_t is 32 bits on my system.  I believe that a 16 bit storage unit
>>> will under normal circumstances occupy a 32 bit memory location, but only
>>> use half of it.
>> Yes, and don't you think that's rather wasteful?  Would you use Xerces-C
>> to process large XML documents if you knew it was wasting half of its
>> character string storage just so it could use wchar_t on all platforms?
> 
> Actually, I did not state my intended meaning well, and I have now come to 
> understand that I was in error.  I was thinking in terms of individual units 
> of storage, i.e., individual characters as opposed to containers.  Containers 
> (at least sequential containers) are basically arrays under the hood, so they 
> do store data contiguously.  I believe an individual 16-bit XMLCh will occupy 
> 32-bits of storage, but that is probably a fairly rare animal, and therefore 
> not worth consideration. 
> 

I guess I don't understand what you mean by "I believe an individual 
16-bit XMLCh will occupy 32-bits of storage."  How can a 16-bit XMLCh 
ever occupy 32 bits of storage?

>>> Why does Xerces-C use a non-standard data type?
>> unsigned short is not a non-standard type.  You may think it's
>> "non-standard" for holding character data, but Xerces-C encodes
>> character data in UTF-16 code units, and that requires a 16-bit integral
>> type.
> 
> It is (AFAIK) not one of the datatypes supported by my Standard Library 
> implementation.  That is my point.  I cannot seamlessly use it with the 
> facilities provided by the C++ Standard Library.

I agree it's a big problem that you cannot use it with 
std::basic_string, but there's no reason why you can't use it with the 
the other containers.  What other facilities do you want to use?

> 
>>> If my implementation doesn't support a particular locale, and
>>>
>>  > therefore does not use a 16 bit or larger data type, then what are the
>>  > chances that I would use Xerces-C to support such a character set?
>>
>> You've got it backwards -- Xerces-C only support the current locale's
>> character set in a very limited fashion, by providing a way to transcode
>> UTF-16 strings to character strings in the current locale.  Otherwise,
>> it operates internally exclusively in UTF-16, and it is unaffected by
>> the current locale or how the system encodes char or wchar_t.
> 
> According to the standard, the C++ implementation must use a wchar_t large 
> enough to hold all the characters used by that local. Combining that 
> requirement with the requirement that implementation needs to support the 
> character literals of the extended character set using the naming specified 
> by ISO/IEC 10646:2000, I conclude that the requirement is virtually identical 
> to the requirement that it support UTF.  But I won't go so far as to say 
> UTF-16.
> 

UTF-16 is an encoding of the 10646/Unicode character set, and you've 
stated previously that the C++ standard does not talk about encodings:

 > The C++ Standard only specifies character sets.  It does not specify
 > encodings.

There is no requirement that a character specified with a universal 
character name be encoded in any particular way -- it's just another way 
to name a character.

My version of the standard also has this to say:

"If the hexadecimal value for a universal character name is less than 
0x20 or in the range 0x7F-0x9F (inclusive), or if the universal 
character name designates a character in the basic source character set, 
then the program is ill-formed."

That restricts the usage of universal character names too severely for 
Xerces-C's purposes.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic