[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Waldo Bastian <bastian () ens ! ascom ! ch>
Date:       1999-05-17 8:18:25
[Download RAW message or body]

Lars Knoll wrote:
> 
> Hi all,
> 
> as you perhaps know, khtml does up to now use 8bit chars for basic
> text encoding. I think we should move that to unicode (QChar/QString), and
> already started working on it a bit.
> 
> Looking at the HTML4.0 specs and rfc2070 (Internationalization of HTML), I
> came to the conclusion, that it would be best, to handle everything in the
> khtml lib using QChar/QStrings, and providing some new class (before the
> tokenizer), which would need to get some information from the http header
> (meaning from kioslave) about the document encoding (if specified), and
> try to determine it if not. This class will also need to be able to
> switch the encoding in the middle of the document, in case it encounters a
> tag with the "charset" attribute. Perhaps parsing of entities could also
> be already done in there.

Switching in the middle of the document is very painfull with the
current
design, since the tokenizer and the parser are not very tight coupled.

My idea was to let the tokenizer run in two batches: 
* The first run decodes the <header>, this contains all META-tags
* The second run decodes the body.

I assumed that within the body, the charsets wouldn't change.

If we can do a convert the input stream to unicode before the tokenizer 
the tokenizer and the html-parser only need minor modifications:
- All "char *" needs to converted to "QChar *". I wouldn't use QStrings
because of memory allocation overhead.

> Like that, the tokenizer and parser would get pure unicode strings, and we
> would not have to deal with most charset related problems in the rest of
> khtml any more.
> 
> Basically, there will be two charset related problems left, that will have
> to be solved in the parser/diplay part of khtml:
> Bidirectional text, and undisplayable characters in the font html uses.

That will be a major pain in the ass. I wouldn't touch that before the
rest of these unicode changes are working.

> Another problem is how to change the tokenizer, parser and the HtmlObjects
> to use QChars/QStrings. After a first glance it seems to me, that it would
> perhaps be the best to do a rewrite of the tokenizer, the parser would
> need some bigger changes, and the HTMLObjects could get away with some
> minor modifications.

The tokenizer is quite new. Changing 'char' to 'QChar' should be enough.

> Since I don't want to make khtml unusuable for you (it'll take a while to
> make all this work), I was thinking about putting all these changes into a
> seperate branch of the CVS, and only merge them back, when they are
> usuable again.
> 
> What do you think?

That would be a good idea.

Cheers,
Waldo

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic