[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: using unicode in khtml
From:       Lars Knoll <Lars.Knoll () mpi-hd ! mpg ! de>
Date:       1999-05-17 9:14:56
[Download RAW message or body]

On Mon, 17 May 1999, Waldo Bastian wrote:

> Lars Knoll wrote:
> > as you perhaps know, khtml does up to now use 8bit chars for basic
> > text encoding. I think we should move that to unicode (QChar/QString), and
> > already started working on it a bit.
> > 
> > Looking at the HTML4.0 specs and rfc2070 (Internationalization of HTML), I
> > came to the conclusion, that it would be best, to handle everything in the
> > khtml lib using QChar/QStrings, and providing some new class (before the
> > tokenizer), which would need to get some information from the http header
> > (meaning from kioslave) about the document encoding (if specified), and
> > try to determine it if not. This class will also need to be able to
> > switch the encoding in the middle of the document, in case it encounters a
> > tag with the "charset" attribute. Perhaps parsing of entities could also
> > be already done in there.
> 
> Switching in the middle of the document is very painfull with the
> current
> design, since the tokenizer and the parser are not very tight coupled.
> 
> My idea was to let the tokenizer run in two batches: 
> * The first run decodes the <header>, this contains all META-tags
> * The second run decodes the body.

Not so easy, since is no clear separation between head and body (according
to the html doc, the body starts with the first tag not belonging to the 
header. The <body> tag can be completely missing.
I decided now to make a new (and small) class dealing with the input
stream khtml gets. The class will be called from the tokenizer, and do the
transformation to unicode. This seems to work already.

> I assumed that within the body, the charsets wouldn't change.

Just looked up the specs once again. You are right. The charset attribute
does always refer to a linked resource and does only appear in three tags
(<a>, <link> and <script>), so we do not have to care about changing
charsets in the middle of the document (which would have been a real pain
in the ass...)

> If we can do a convert the input stream to unicode before the tokenizer 
> the tokenizer and the html-parser only need minor modifications:
> - All "char *" needs to converted to "QChar *". I wouldn't use QStrings
> because of memory allocation overhead.

Tried it. But there are many (char *) specific things used (both in the
tokenizer and in the parser), which need to be changed. I'm working on
it, and I hope a get it working soon.

> > Like that, the tokenizer and parser would get pure unicode strings, and we
> > would not have to deal with most charset related problems in the rest of
> > khtml any more.
> > 
> > Basically, there will be two charset related problems left, that will have
> > to be solved in the parser/diplay part of khtml:
> > Bidirectional text, and undisplayable characters in the font html uses.
> 
> That will be a major pain in the ass. I wouldn't touch that before the
> rest of these unicode changes are working.

Sure. I just wanted to mention them. First of all, everything has to work
with QChar's, then we can see further.

> 
> > Another problem is how to change the tokenizer, parser and the HtmlObjects
> > to use QChars/QStrings. After a first glance it seems to me, that it would
> > perhaps be the best to do a rewrite of the tokenizer, the parser would
> > need some bigger changes, and the HTMLObjects could get away with some
> > minor modifications.
> 
> The tokenizer is quite new. Changing 'char' to 'QChar' should be enough.

Working on that at the moment. But in the long run (after I have it
working with the current tokenizer and parser...), I was thinking of
perhaps a bit different design of the tokenizer. We are already using a
hash table for the tags, and I thought of extending that to the attributes
too. This should make things a bit faster, and has the big advantage,
since you can return additional information from the hash table.
One example is, if we should parse entities in this attribute or not,
another one it information about what you expect in the attribute value.
Like that, one could perhaps already do some preprocessing of the
attribute's value there.

Unicode has the big advantage of having a rather large area for private
use, and I wanted to use this area, for encoding all the tags and
attributes, each of them with a single unicode character. The code in the
parser would IMO profit a lot from such an approach (you could do the
parsing of the attributes with a switch statement, instead of lots of
string comparisions).

> 
> > Since I don't want to make khtml unusuable for you (it'll take a while to
> > make all this work), I was thinking about putting all these changes into a
> > seperate branch of the CVS, and only merge them back, when they are
> > usuable again.
> > 
> > What do you think?
> 
> That would be a good idea.

OK, so I'll put it into a branch (something like "PORTING_TO_QCHAR").
Whoever is interested can use it, the others can stay with the main line.
As soon as we agree it's stable enough we can merge it back into the main
branch.

Cheers,
Lars

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic