[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Waldo Bastian <bastian () ens ! ascom ! ch>
Date:       1999-05-17 14:57:01
[Download RAW message or body]

Lars Knoll wrote:
> > Or is <HEAD></HEAD> optional? I don't think so.
> 
> It is. And so is <body>...

That would make it something like:

while(tag_is_header_tag) process_header_tag;

With header tags being HEAD/TITLE/META.... anything else? STYLE?

> Several things. First of all, I'll add a call to the khtmlwidget, to set
> a specific charset (in case it is known from the header, or the user
> requests aspecific one). Second possibility, is the <meta> tag. This
> assumes, that the encoding does at least leave ascii unchanged. 

This is a valid assumption according to some recommendation.

> This is
> the case for most of them (except for utf16, for which I added a small
> hook (then the document should start with 0xfeff or 0xfffe, depending on
> byte order)). As long as the document charset is unknown, and we are still
> in the header, no output is returned to the tokenizer. As soon as we get
> to know it (by parsing a meta tag or finding utf16), the whole document
> (from the first byte on) as received so far is converted to Unicode using
> QTextCodecs, and returned to the tokenizer.

Is your class looking for META-tags itself? 
I thought of something like:

1. Guess charset (with a check for utf16) (charset-converter)
2. Convert to unicode using guessed charset (charset-converter)
3. Run tokenizer until it has parsed the header (tokenizer)
4. Determine definitive charset. Check if guess was ok? If yes, go to 7.
5. Rewind input stream & reset tokenizer.
6. Convert to unicode using definitive charset. (charset-converter)
7. Run tokenizer until finished. (tokenizer)

Perhaps step 5 is not needed at all, if the header doesn't contain any
charset specific stuff..  However, I think the contents of the TITLE 
tag is quite depending on the charset.

> > According to the spec entities should always be parsed.
> 
> Yes and no... As far as I remember, some attribute values do not allow
> them (if you take the specs literally...)

I always take specs very literally. I only came across some examples
stating to take care with using '&' inside javascript. (When supplied
in an attribute) For the rest this is defined by SGML which states
that entities should be expanded inside attributes.

If this is defined somewhere else otherwise please let me know.

Unfortunately not a lot of people making web-pages are aware of this.

> About. I think it would be good to move the control chars to the unicode
> private use area. Things would look about like (everything QChars...)
> 
> "Hello " or "\0xE000Hello "        -- perhaps add a hint for text
> "\0xE021\0xE435+4\0xE423helvetica" -- FONT start tag, 0xE000-0xE0FF for
>                                       tags, 0xE400-0xE4FF for attributes,
> "World" or "\0xE100World"          -- again text
> "\0xE022       "                   -- FONT end tag
> 
> and perhaps use something like \0xEFFF as an end of token character.

What would be the benefit of using the private use area for 
this? The meaning of the character, in this context, is defined 
by its position, not by its value. I don't see why we should use
0xE400-0xE4FF for attributes instead of for example 0x00-0xff 
(although 0x00 might be a bad idea).

*some thinking later*

In the example you gave, you combine the "end-of-attribute-value" 
delimiter with the "attribute-token". The (small) advantage of 
having a seperate "end-of-attribute-value" (the \r) is (was) 
that it could be replaced by \0 and that you then have a null-
terminated string. 

If we want to use 0xE000-0xEfff range for internal use, we also
have to make sure we won't get confused when these characters
appear in the HTML itself.
(\r never appears in the HTML since we convert it to a \n, 
and &#10 is passed as an entity and converted in the parser, 
not in the tokenizer.)

Cheers,
Waldo
-- 
KDE, Making The Future of Computing Available Today       
http://www.kde.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic