'Re: using unicode in khtml'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Waldo Bastian <bastian () ens ! ascom ! ch>
Date:       1999-05-18 8:38:32
[Download RAW message or body]

Lars Knoll wrote:
> >From the HTML-4.0 doc:
> <quote>
> Although the STYLE and SCRIPT elements use CDATA for their data model, for
> these elements, CDATA must be handled differently by user agents. Markup
> and entities must be treated as raw text and passed to the application as
> is. The first occurrence of the character sequence "</" (end-tag open
> delimiter) is treated as terminating the end of the element's content. In
> valid documents, this would be the end tag for the element.
> </quote>
> 
> IMO, this means, we shouldn't parse entities in these elements, and (for
> reason of consistency), this should then also happen with attributes
> defining a small script (like onmouseover, etc...).

Inside elements, Yes. Inside attributes, No.

HTML4.0/3.2.2 Attributes:
"By default, SGML requires that all attribute values be delimited 
using either double quotation marks (ASCII decimal 34) or single 
quotation marks (ASCII decimal 39). Single quote marks can be 
included within the attribute value when the value is delimited 
by double quote marks, and vice versa. Authors may also use numeric 
character references to represent double quotes (&#34;) and single 
quotes (&#39;). For double quotes authors can also use the character
entity reference &quot;. "

And HTML4.0/B.3.2 Attribute values:

"When script or style data is the value of an attribute (either 
style or the intrinsic event attributes), authors should escape 
occurrences of the delimiting single or double quotation mark 
within the value according to the script or style language 
convention. Authors should also escape occurrences of "&" if 
the "&" is not meant to be the beginning of a character reference. 

     '"' should be written as "&quot;" or "&#34;" 
     '&' should be written as "&amp;" or "&#38;" 

Thus, for example, one could write:

   <INPUT name="num" value="0"
   onchange="if (compare(this.value, &quot;help&quot;)) {gethelp()}">
"

> The short answer is, that you save a QChar. The higher byte tells you it's
> an attribute, the lower one which one.

Ok.

> Do we really still need null terminated strings? I thought of passing
> tokens as QConstString's (which is basically not different from a pointer
> to a QChar and the length of the string, QConstString avoids copying the
> QChar array...). Advantage is, that we can use all const member of
> QString, which will make handling much easier.

But we should take care to only create a QConstString object for the
single token we are processing. 

> > If we want to use 0xE000-0xEfff range for internal use, we also
> > have to make sure we won't get confused when these characters
> > appear in the HTML itself.
> 
> You are right, we should make a check, but the chance of them appearing is
> small, because using characters in that range is not portable.

It's just that we shouldn't crash on it.

> > (\r never appears in the HTML since we convert it to a \n,
> > and &#10 is passed as an entity and converted in the parser,
> > not in the tokenizer.)
> 
> That will change. I am now converting entities to QChar's in the
> tokenizer (IMO that's the place where it should be).

Yes.

Cheers,
Waldo

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic