'Re: How do I use Xerces strings?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-c-dev
Subject:    Re: How do I use Xerces strings?
From:       "Steven T. Hatton" <hattons () globalsymmetry ! com>
Date:       2006-03-10 2:05:46
Message-ID: 200603092105.46951.hattons () globalsymmetry ! com
[Download RAW message or body]

On Thursday 09 March 2006 19:33, David Bertoni wrote:
> Steven T. Hatton wrote:
> > On Thursday 09 March 2006 14:14, David Bertoni wrote:
> >> Steven T. Hatton wrote:
> >>
> >> I guess I don't understand what you mean by "I believe an individual
> >> 16-bit XMLCh will occupy 32-bits of storage."  How can a 16-bit XMLCh
> >> ever occupy 32 bits of storage?
> >
> > What is the CPU going to stick in the other 16 bits of a 32 bit word when
> > it stores a single XMLCh?
>
> We must be talking about two different things, because I'm talking about
> an array of 16-bit integrals, so no 32-bit units of storage are involved.

That is why I explicitly referred to individual XMLCh values as opposed to 
sequential containers.

> > I am under the impression that the stipulation you quoted only applies to
> > character literals. AFAIK Xerces-C doesn't support character literal of
> > any kind.  Correct?
>
> Again, I guess we are talking about two different things.  At one point,
> you were trying to prove that you could use wide character literals and
> wide string literals, so I assumed you were trying to show that there is
> a way to specify those things such that they are encoded in a Unicode
> encoding.

The ranges in question appear to be explicitly set asside for certain 
purposes, or intentionally unspecified by the Unicode Standard.  In some 
cases these "characters" overlap with specific ASCII control characters, and 
can be expressed using the existing C++ character literal representations.  
In the cases where the C++ Standard does not explicitly specify basic 
character set representations, even in a fully UTF compliant implementation, 
there would be no guarantee required of the implementation to allow you to 
use those encodings.  IOW, you may need those values, but UTF does not give 
them to you.
  
<quote url="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf">
15.1 Control Codes There are 65 code points set aside in the
Unicode Standard for compatibility with the C0 and C1 control
codes defined in the ISO/IEC 2022 framework. The ranges of these
code points are U+0000..U+001F, U+007F, and U+0080..U+009F, which
correspond to the 8-bit controls 0016 to 1F16 (C0 controls),
7F16 (delete), and 8016 to 9F16 (C1 controls), respectively.  For
example, the 8-bit legacy control code character tabulation (or
tab) is the byte value 0916; the Unicode Standard encodes the
corresponding control code at U+0009.  The Unicode Standard
provides for the intact interchange of these code points, neither
adding to nor subtracting from their semantics. The semantics of
the control codes are generally determined by the application
with which they are used. However, in the absence of specific
application uses, they may be interpreted according to the
control function semantics specified in ISO/IEC 6429.

In general, the use of control codes constitutes a higher-level
protocol and is beyond the scope of the Unicode Standard. For
example, the use of ISO/IEC 6429:1992 control sequences for
controlling bidirectional formatting would be a legitimate
higher-level protocol layered on top of the plain text of the
Unicode Standard. Higher-level protocols are not specified by the
Unicode Standard; their existence cannot be assumed without a
separate agreement between the parties interchanging such data.
</quote>


> First of all, UTF-16 only encodes the Unicode character set, so I'm not
> sure what you mean by "support for character sets outside of the BMP."
> Do you mean support for Unicode code points outside the BMP?

I mean character encodings which require more than one 16-bit unit of storage.

> By definition, UTF-16 supports encoding characters outside the BMP, so
> you cannot purport to encode Unicode code points in UTF-16 and not
> support them.

If I did only support the BMP, and claim I supported UTF-16, I would not be 
the first person to do so.  But my question is not about whether it is 
non-compliant to fail to support multi-unit character encodings.  My question 
is about the CPU cycles required to provide that support.

> However, lots of applications don't require such operations, so it
> doesn't matter.  If your application does, then you need to measure the
> overhead cost of using UTF-32 vs. the run-time cost of a multi-unit
> encoding like UTF-16.

That is basically my question. Is there much real cost in using UTF-16 as 
opposed to UTF-32.  The impression I'm getting is that UTF-16 may well be the 
better choice for the vast majority of applications.

Steven

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic