'Re: kdelibs/khtml (fwd)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-i18n
Subject:    Re: kdelibs/khtml (fwd)
From:       Lars Knoll <knoll () mpi-hd ! mpg ! de>
Date:       1999-04-06 12:19:47
[Download RAW message or body]


----------  Forwarded Message  ----------
Subject: Re: kdelibs/khtml
Date: Tue, 06 Apr 1999 10:57:02 +0200
From: Waldo Bastian <bastian@ens.ascom.ch>


Lars Knoll wrote:
[...]
> Something else (just some thoughts): I thought a bit about some principal design
> questions after moving to qt-2.0, which we should perhaps work on in the long
> term. Qt handles everything in Unicode internally now, and I think we will at
> some point have to switch khtml to use it too. Now I was thinking yesterday
> evening a bit about how one could do that in the best way, and I thought that
> it might be best to do the conversion from the document encoding we get from
> the server to Unicode as soon as possible (meaning perhaps that the kioslave
> should support it, and konqueror should already feed Unicode into khtml). This
> would be needed, in case the server gives us HTML encoded in Unicode (not used
> for now, but this might come very soon...). Putting it into the kioslave is
> easy in case the encoding is specified in the HTML-Header, and Qt has some very
> nice methods to do the actual conversion. The problem is what to do with HTML
> pages, where the charset is specified as in a meta tag in the document itself.

Currently we don't use QString for HTML code within khtml for reasons of 
efficiency. From a design point of view it would be nice to always use
Unicoded 
HTML inside khtml. The tokenizer is a good candidate to do the
conversion.
The drawback is that it takes up twice as much memory for the text on
the average
page. Most pages today don't contain that much text luckily... so in the
end
this should not be too bad.

With respect to meta-tags. I would like to split up the HTML parsing:
first do
the HEAD-part, then to the body part. This is proably required for CSS
anyway
since we will need to have the stylesheet available when we start
parsing the
document. It is also easier for handling META-tag redirection, charsets
and
things like that.

The most easy migration path would be to move from "char *" to something
like
"w_char *" inside khtml for HTML text. When we have that we can change
the
tokenizer to convert all HTML to unicode encoded. Having the tokenizer
actually
handle HTML documents with multi-byte encodings would be a bit more
difficult. 
I think the easiest would be to duplicate the tokenizer and have a
seperate single-byte and multi-byte tokenizer.

Further we would need some logic to find out the encoding. We can't
really trust
the HTTP-headers for this but this should not pose any major problems.

Please forward this to kfm-devel. I still have to resubscribe myself.

Cheers,
Waldo

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic