'Re: using unicode in khtml'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Lars Knoll <Lars.Knoll () mpi-hd ! mpg ! de>
Date:       1999-05-17 13:58:23
[Download RAW message or body]

On Mon, 17 May 1999, Waldo Bastian wrote:

> Lars Knoll wrote:
>  
> > Not so easy, since is no clear separation between head and body (according
> > to the html doc, the body starts with the first tag not belonging to the
> > header. 
> 
> The header is everything between <HEAD> </HEAD> isn't it? If we
> encounter
> something else we have reached the body and can stop.
> 
> Or is <HEAD></HEAD> optional? I don't think so.

It is. And so is <body>...

> > The <body> tag can be completely missing.
> > I decided now to make a new (and small) class dealing with the input
> > stream khtml gets. The class will be called from the tokenizer, and do the
> > transformation to unicode. This seems to work already.
> 
> How does it find out which charset should be used? 

Several things. First of all, I'll add a call to the khtmlwidget, to set 
a specific charset (in case it is known from the header, or the user
requests aspecific one). Second possibility, is the <meta> tag. This
assumes, that the encoding does at least leave ascii unchanged. This is
the case for most of them (except for utf16, for which I added a small
hook (then the document should start with 0xfeff or 0xfffe, depending on
byte order)). As long as the document charset is unknown, and we are still
in the header, no output is returned to the tokenizer. As soon as we get
to know it (by parsing a meta tag or finding utf16), the whole document
(from the first byte on) as received so far is converted to Unicode using
QTextCodecs, and returned to the tokenizer.

In case I don't find anything specifying the charset up to the first tag
of the body, I assume Latin1.

> > > I assumed that within the body, the charsets wouldn't change.
> > 
> > Just looked up the specs once again. You are right. The charset attribute
> > does always refer to a linked resource and does only appear in three tags
> > (<a>, <link> and <script>), so we do not have to care about changing
> > charsets in the middle of the document (which would have been a real pain
> > in the ass...)
> 
> Pfew :)
> 
> > > If we can do a convert the input stream to unicode before the tokenizer
> > > the tokenizer and the html-parser only need minor modifications:
> > > - All "char *" needs to converted to "QChar *". I wouldn't use QStrings
> > > because of memory allocation overhead.
> > 
> > Tried it. But there are many (char *) specific things used (both in the
> > tokenizer and in the parser), which need to be changed. I'm working on
> > it, and I hope a get it working soon.
> 
> There are some "special" characters used to indicate the start of a TAG.
> But that should translate transparant from char * to QChar /me thinks.

Yes, that doesn't make problems. In fact, there's no really big problem.
It's just a lot of small things, like testing if the current char is equal
to '\0' to find the end of the string, using strcmp and related functions
and so on. But it's nothing really complicated, just takes a while.

> > > > Another problem is how to change the tokenizer, parser and the HtmlObjects
> > > > to use QChars/QStrings. After a first glance it seems to me, that it would
> > > > perhaps be the best to do a rewrite of the tokenizer, the parser would
> > > > need some bigger changes, and the HTMLObjects could get away with some
> > > > minor modifications.
> > >
> > > The tokenizer is quite new. Changing 'char' to 'QChar' should be enough.
> > 
> > Working on that at the moment. But in the long run (after I have it
> > working with the current tokenizer and parser...), I was thinking of
> > perhaps a bit different design of the tokenizer. We are already using a
> > hash table for the tags, and I thought of extending that to the attributes
> > too. This should make things a bit faster, and has the big advantage,
> > since you can return additional information from the hash table.
> > One example is, if we should parse entities in this attribute or not,
> > another one it information about what you expect in the attribute value.
> > Like that, one could perhaps already do some preprocessing of the
> > attribute's value there.
> 
> According to the spec entities should always be parsed.

Yes and no... As far as I remember, some attribute values do not allow
them (if you take the specs literally...)
> 
> The tokenizer already does some preprocessing on the attributes:
> - it unquotes them
> - it marks the end of each attribute (with a \r I believe)
> 
> Replacing the attribute-name with an attribute-index (using a hash
> table) 
> would be a minor change.
> 
> > Unicode has the big advantage of having a rather large area for private
> > use, and I wanted to use this area, for encoding all the tags and
> > attributes, each of them with a single unicode character. The code in the
> > parser would IMO profit a lot from such an approach (you could do the
> > parsing of the attributes with a switch statement, instead of lots of
> > string comparisions).
> 
> We currently use some ASCII < 32 for this. Like '\r'. The next character
> is
> then the index-value of the tag, followed by the attribute/value pairs.
> 
> Out of my head it translates something like this:
> 
> "Hello <FONT size='+4' face="helvetica">World</FONT>"
> 
> to:
> 
> "Hello \000\r\021size=+4\rface=helvetica\000World\000\r\221\000"
> 
> Which is returned as the following tokens:
> 
> "Hello "	-- Text
> "\r\021size=+4\rface=helvetica" -- FONT start tag
> "World" 	-- Text
> "\r\221\000"			-- FONT end tag

About. I think it would be good to move the control chars to the unicode
private use area. Things would look about like (everything QChars...)

"Hello " or "\0xE000Hello "        -- perhaps add a hint for text
"\0xE021\0xE435+4\0xE423helvetica" -- FONT start tag, 0xE000-0xE0FF for
                                      tags, 0xE400-0xE4FF for attributes,
"World" or "\0xE100World"          -- again text
"\0xE022       "                   -- FONT end tag

and perhaps use something like \0xEFFF as an end of token character.

Cheers,

Lars

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic