'Re: using unicode in khtml'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Lars Knoll <Lars.Knoll () mpi-hd ! mpg ! de>
Date:       1999-05-17 17:15:34
[Download RAW message or body]

On Mon, 17 May 1999, Waldo Bastian wrote:

> Lars Knoll wrote:
> > > Or is <HEAD></HEAD> optional? I don't think so.
> > 
> > It is. And so is <body>...
> 
> That would make it something like:
> 
> while(tag_is_header_tag) process_header_tag;
> 
> With header tags being HEAD/TITLE/META.... anything else? STYLE?
Here's the complete list:
	SCRIPT STYLE META LINK OBJECT TITLE BASE

> > Several things. First of all, I'll add a call to the khtmlwidget, to
> set > a specific charset (in case it is known from the header, or the
> user > requests aspecific one). Second possibility, is the <meta> tag.
> This > assumes, that the encoding does at least leave ascii unchanged.
> 
> This is a valid assumption according to some recommendation.
> 
> > This is
> > the case for most of them (except for utf16, for which I added a small
> > hook (then the document should start with 0xfeff or 0xfffe, depending on
> > byte order)). As long as the document charset is unknown, and we are still
> > in the header, no output is returned to the tokenizer. As soon as we get
> > to know it (by parsing a meta tag or finding utf16), the whole document
> > (from the first byte on) as received so far is converted to Unicode using
> > QTextCodecs, and returned to the tokenizer.
> 
> Is your class looking for META-tags itself? 
Yes. 

> I thought of something like:
> 
> 1. Guess charset (with a check for utf16) (charset-converter)
> 2. Convert to unicode using guessed charset (charset-converter)
> 3. Run tokenizer until it has parsed the header (tokenizer)
> 4. Determine definitive charset. Check if guess was ok? If yes, go to 7.
> 5. Rewind input stream & reset tokenizer.
> 6. Convert to unicode using definitive charset. (charset-converter)
> 7. Run tokenizer until finished. (tokenizer)

What I do at the moment is not so different. I first do a small check for
utf16, if that fails, I go through the header looking for <meta> tags,
until I either find
	1. a charset specification
	2. the first non header tag

If I encounter either of the 2, the charset is fixed, the whole input
(including header) is converted to unicode, and then passed to the
tokenizer. Like that, the tokenizer parses the input just once, the class
to determine the charset does only a quick scan to find the encoding.

> Perhaps step 5 is not needed at all, if the header doesn't contain any
> charset specific stuff..  However, I think the contents of the TITLE 
> tag is quite depending on the charset.

The title tag is quite dependent on it, and text in scripts may be aswell.

> > > According to the spec entities should always be parsed.
> > 
> > Yes and no... As far as I remember, some attribute values do not allow
> > them (if you take the specs literally...)
> 
> I always take specs very literally. I only came across some examples
> stating to take care with using '&' inside javascript. (When supplied
> in an attribute) For the rest this is defined by SGML which states
> that entities should be expanded inside attributes.
> 
> If this is defined somewhere else otherwise please let me know.

From the HTML-4.0 doc:
<quote>
Although the STYLE and SCRIPT elements use CDATA for their data model, for
these elements, CDATA must be handled differently by user agents. Markup
and entities must be treated as raw text and passed to the application as
is. The first occurrence of the character sequence "</" (end-tag open
delimiter) is treated as terminating the end of the element's content. In
valid documents, this would be the end tag for the element.
</quote>

IMO, this means, we shouldn't parse entities in these elements, and (for
reason of consistency), this should then also happen with attributes
defining a small script (like onmouseover, etc...).

> > Unfortunately not a lot of people making web-pages are aware of this.

This might be a problem...

> 
> > About. I think it would be good to move the control chars to the unicode
> > private use area. Things would look about like (everything QChars...)
> > 
> > "Hello " or "\0xE000Hello "        -- perhaps add a hint for text
> > "\0xE021\0xE435+4\0xE423helvetica" -- FONT start tag, 0xE000-0xE0FF for
> >                                       tags, 0xE400-0xE4FF for attributes,
> > "World" or "\0xE100World"          -- again text
> > "\0xE022       "                   -- FONT end tag
> > 
> > and perhaps use something like \0xEFFF as an end of token character.
> 
> What would be the benefit of using the private use area for 
> this? The meaning of the character, in this context, is defined 
> by its position, not by its value. I don't see why we should use
> 0xE400-0xE4FF for attributes instead of for example 0x00-0xff 
> (although 0x00 might be a bad idea).

The short answer is, that you save a QChar. The higher byte tells you it's
an attribute, the lower one which one.
> 
> *some thinking later*
> 
> In the example you gave, you combine the "end-of-attribute-value" 
> delimiter with the "attribute-token". The (small) advantage of 
> having a seperate "end-of-attribute-value" (the \r) is (was) 
> that it could be replaced by \0 and that you then have a null-
> terminated string. 

Do we really still need null terminated strings? I thought of passing
tokens as QConstString's (which is basically not different from a pointer
to a QChar and the length of the string, QConstString avoids copying the
QChar array...). Advantage is, that we can use all const member of
QString, which will make handling much easier.

> 
> If we want to use 0xE000-0xEfff range for internal use, we also
> have to make sure we won't get confused when these characters
> appear in the HTML itself.

You are right, we should make a check, but the chance of them appearing is
small, because using characters in that range is not portable.

> (\r never appears in the HTML since we convert it to a \n, 
> and &#10 is passed as an entity and converted in the parser, 
> not in the tokenizer.)

That will change. I am now converting entities to QChar's in the
tokenizer (IMO that's the place where it should be).

Cheers,
Lars

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic