[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       David Bertoni <dbertoni () apache ! org>
Date:       2006-03-10 0:33:54
Message-ID: 4410C972.5010802 () apache ! org
[Download RAW message or body]

Steven T. Hatton wrote:
> On Thursday 09 March 2006 14:14, David Bertoni wrote:
>> Steven T. Hatton wrote:
> 
>> I guess I don't understand what you mean by "I believe an individual
>> 16-bit XMLCh will occupy 32-bits of storage."  How can a 16-bit XMLCh
>> ever occupy 32 bits of storage?
> 
> What is the CPU going to stick in the other 16 bits of a 32 bit word when it 
> stores a single XMLCh?

We must be talking about two different things, because I'm talking about 
an array of 16-bit integrals, so no 32-bit units of storage are involved.

> 
>> I agree it's a big problem that you cannot use it with
>> std::basic_string, but there's no reason why you can't use it with the
>> the other containers.  What other facilities do you want to use?
> 
> Well, I'm still learning the Standard Library, so I don't really know what I 
> can get of the std::basic_string.  I know it has a bunch of seaching and 
> manipulation functions.  In all likelyhood, I will end up using QString for 
> my UI.  I'm working on a C++ project management infrastructure, and felt 
> somewhat compromised by having to rely on Qt.  Not that I have anything 
> against Qt.  It think it's fantastic.  I just wanted to build the basics of 
> the program using Standard C++.
>  
>> UTF-16 is an encoding of the 10646/Unicode character set, and you've
>>
>> stated previously that the C++ standard does not talk about encodings:
>>  > The C++ Standard only specifies character sets.  It does not specify
>>  > encodings.
>>
>> There is no requirement that a character specified with a universal
>> character name be encoded in any particular way -- it's just another way
>> to name a character.
> 
> There's an isomorphism in there somewhere which, in principle, could be 
> leveraged to bridge between the encodings.  I'm not saying it would be worth 
> doing.
> 
>> My version of the standard also has this to say:
>>
>> "If the hexadecimal value for a universal character name is less than
>> 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal
>> character name designates a character in the basic source character set,
>> then the program is ill-formed."
>>
>> That restricts the usage of universal character names too severely for
>> Xerces-C's purposes.
> 
> I am under the impression that the stipulation you quoted only applies to 
> character literals. AFAIK Xerces-C doesn't support character literal of any 
> kind.  Correct?

Again, I guess we are talking about two different things.  At one point, 
you were trying to prove that you could use wide character literals and 
wide string literals, so I assumed you were trying to show that there is 
a way to specify those things such that they are encoded in a Unicode 
encoding.

> 
> What I really want to know is whether there is significant cost associated 
> with using UTF-16 with support for character sets outside of the BMP.  In 
> some operations that would require the program to sniff every character to 
> detect if it is multi-unit.  From thingking through scenarios, it seem likely 
> that you could get away with ignoring that aspect of the encoding.
> 

First of all, UTF-16 only encodes the Unicode character set, so I'm not 
sure what you mean by "support for character sets outside of the BMP." 
Do you mean support for Unicode code points outside the BMP?

By definition, UTF-16 supports encoding characters outside the BMP, so 
you cannot purport to encode Unicode code points in UTF-16 and not 
support them.

If you're going to provide an index operation for a UTF-16 string where 
the semantics dictate that you are indexing Unicode code points rather 
than UTF-16 code units, then yes, you will be forced to examine the 
string for surrogate pairs.  The same would hold true for a length 
operation that counted the number of Unicode code points, or for a 
substring operation with those semantics or one that guarantees a 
surrogate pair will never be split.

However, lots of applications don't require such operations, so it 
doesn't matter.  If your application does, then you need to measure the 
overhead cost of using UTF-32 vs. the run-time cost of a multi-unit 
encoding like UTF-16.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic