'using KHTML without display'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    using KHTML without display
From:       "Nom Declavier" <achats () blarg ! net>
Date:       2005-05-02 1:31:27
Message-ID: 003d01c54eb6$a9197740$0200a8c0 () Dell030306
[Download RAW message or body]

I'd like to use KHTML to parse HTML/CSS/Javascript, and to deduce the
sizes and positions of Web page constituents when the page is rendered
by a KHTML-based browser. But I want to do this without actually
rendering to any screen, and without invoking more browser functionality
than I need. My planned application has no graphical user interface. It
brings about no display. It's all about trees whose nodes may be
annotated with size and position information. I expect to call getRect()
frequently.

So what I really need is DOM::Document and so on. I'll use KApplication,
KHTMLPart, KHTMLView, and so on, only as I need them to invoke the
functionality of DOM, CSS, and KJS classes.

I'm aware of two ways to get a DOM::HTMLDocument from an HTML file, without
getting into windows and widgets.

Technique 1 looks like this:

DOM::HTMLDocument doc;
doc.setAsync(false);
doc.load(url);

Technique 1 has two serious problems. First, when getRect() is called on
nodes, it produces no useful information. Second, when the program
exits, either the automatically-invoked destructors bomb, or if I invoke
destructors myself, they still bomb.

Technique 2 looks like this, where inputHTMLQString is a QString that's
read from the HTML file, it doesn't matter how.

KHTMLPart * pPart = new KHTMLPart();
pPart->begin();
pPart->write(inputHTMLQString);
pPart->end();
DOM::Document doc = pPart->document();

Technique 2 has two serious problems. First, when getRect() is called on
nodes, it produces no useful information, so there's nothing to choose
between Technique 1 and Technique 2 here. Technique 2 leads to graceful
destruction, but it brings along by default a very fussy version of the
HTML parser which wreaks havoc with scripts, among other constituents. I
can get around the fussy parser, but the way I've done it so far isn't
pretty.

If I want to have all of the following:

calls to getRect() produce useful results
tolerant parser
effective destruction

and I want to have them to the extent possible without without windows
and widgets, I'm guessing the best technique isn't either of the ones
I've tried. Best aside, what's a good technique?


[Attachment #3 (text/html)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
<META content="MSHTML 6.00.2900.2627" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT size=2>
<DIV><FONT size=2><FONT size=3>I'd like to use KHTML to parse 
HTML/CSS/Javascript, and to deduce the<BR>sizes and positions of Web page 
constituents when the page is rendered<BR>by a KHTML-based browser. But I want 
to do this without actually<BR>rendering to any screen, and without invoking 
more browser functionality<BR>than I need. My planned application has no 
graphical user interface. It<BR>brings about no display. It's all about trees 
whose nodes may be<BR>annotated with size and position information. I expect to 
call getRect()<BR>frequently.<BR><BR>So what I really need is DOM::Document and 
so on. I'll use KApplication,<BR>KHTMLPart, KHTMLView, and so on, only as I need 
them to invoke the<BR>functionality of DOM, CSS, and KJS classes.<BR><BR>I'm 
aware of two ways to get a DOM::HTMLDocument from an HTML file, 
without<BR>getting into windows and widgets.<BR><BR>Technique 1 looks like 
this:<BR><BR>DOM::HTMLDocument 
doc;<BR>doc.setAsync(false);<BR>doc.load(url);<BR><BR>Technique 1 has two 
serious problems. First, when getRect() is called on<BR>nodes, it produces no 
useful information. Second, when the program<BR>exits, either the 
automatically-invoked destructors bomb, or if I invoke<BR>destructors myself, 
they still bomb.<BR><BR>Technique 2 looks like this, where inputHTMLQString is a 
QString that's<BR>read from the HTML file, it doesn't matter 
how.<BR><BR>KHTMLPart * pPart = new 
KHTMLPart();<BR>pPart-&gt;begin();<BR>pPart-&gt;write(inputHTMLQString);<BR>pPart-&gt;end();<BR>DOM::Document \
 doc = pPart-&gt;document();<BR><BR>Technique 2 has two serious problems. First, 
when getRect() is called on<BR>nodes, it produces no useful information, so 
there's nothing to choose<BR>between Technique 1 and Technique 2 here. Technique 
2 leads to graceful<BR>destruction, but it brings along by default a very fussy 
version of the<BR>HTML parser which wreaks havoc with scripts, among other 
constituents. I<BR>can get around the fussy parser, but the way I've done it so 
far isn't<BR>pretty.<BR><BR>If I want to have all of the following:<BR><BR>calls 
to getRect() produce useful results<BR>tolerant parser<BR>effective 
destruction<BR><BR>and I want to have them to the extent possible without 
without windows<BR>and widgets, I'm guessing the best technique isn't either of 
the ones<BR>I've tried. Best aside, what's a good 
technique?</FONT><BR><BR></FONT></DIV></FONT></DIV></BODY></HTML>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic