[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pykde
Subject:    [PyQt] Get all web info including info generated by AJAX using QtWebKit
From:       flyer <flyer103 () gmail ! com>
Date:       2012-10-16 3:11:43
Message-ID: CAKLbBG9q42Jr0gDDrc4RVzBunS4ZQF3Oeq0f9pyf9BipDt_B6Q () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


I wrote a python script using QtWebKit to get all page info including info
generated by AJAX requests. I run the following code on CentOS Server and
do the following settings:


> $ Xvfb :100 -screen 0 9000x9000x24 &

export DISPLAY=:100


The following code worked, however, it could only get *one-screen* info of
the web page, namely, getting different amount of info according to the
screen resolution. I could only get part of the info of the webpage.

I have tried using *selenium *and I can get all web info if I set large
screen resolution using *Xvfb* .

Please give me some tips about how to solve the problem and any manual
for *QtWebKit
*is also appreciated because I can't find more materials about it.

And the following code can exit automatically after getting the work done.
I can't find where's the bug¡­¡­ Everytime I must use the command *kill* to
terminate the script.

Thanks anyway.

The following is my code:

#!/usr/bin/env python
>
> #coding: utf-8
>
>
>> import sys
>
>
>> from PyQt4.QtCore import QUrl, SIGNAL, QSize
>
> from PyQt4.QtGui import QApplication
>
> from PyQt4.QtWebKit import QWebPage, QWebView
>
>
>>
>> class WebPage(QWebPage):
>
>
>
>     def javaScriptConsoleMessage(self, message, lineNumber, sourceID):
>
>         sys.stderr.write('Javascritp error at line number %d\n' %
>> (lineNumber))
>
>         sys.stderr.write('%s\n' % (message, ))
>
>         sys.stderr.write('Source ID: %s\n' % (sourceID, ))
>
>
>>
>> class Crawler(QApplication):
>
>
>
>     def __init__(self, url):
>
>         super(Crawler, self).__init__(sys.argv)
>
>
>
>         self.url = url
>
>         self.web_view = QWebView()
>
>         self.web_page = WebPage()
>
>         self.web_view.setPage(self.web_page)
>
>         self.web_frame = self.web_page.currentFrame()
>
>
>>         self.qsize = QSize()
>
>         self.qsize.setHeight(9000)
>
>         self.qsize.setWidth(9000)
>
>
>>
>
>         # self.settings.setAttribute(QWebSettings.AutoLoadImages, False)
>
>         # self.setttings.setAttribute(QWebSettings.PluginsEnabled, False)
>
>
>>         # self.setMaximumSize(10000, 10000)
>
>
>>         #
>> self.web_page.setViewportSize(self.web_page.mainFrame().contentsSize())
>
>
>>         print 'Before connecting'
>
>         self.connect(self.web_view, SIGNAL('loadFinished(bool)'),
>> self.loadFinished)
>
>         print 'After connecting'
>
>
>>         print 'Before loading'
>
>         self.web_frame.load(QUrl(self.url))
>
>         print 'After loading'
>
>
>
>     def loadFinished(self, ok):
>
>         print 'In callback, before writing'
>
>         with open('jd.txt', 'ab+') as fp:
>
>             fp.write(self.web_page.currentFrame().toHtml().toUtf8())
>
>         print 'In callback, after writing'
>
>
>>
>> if __name__ == '__main__':
>
>     url = 'http://www.360buy.com/product/729487.html'
>
>     crawler = Crawler(url)
>
>     sys.exit(crawler.exec_())
>
>
>
-- 
³èÈè²»¾ª£¬Ïп´Í¥Ç°»¨¿ª»¨ 䣻ȥÁôÎÞÒ⣬ þËæÌì±ßÔƾíÔÆÊæ¡£

[Attachment #5 (text/html)]

I wrote a python script using QtWebKit to get all page info including info generated \
by AJAX requests. I run the following code on CentOS Server and do the following \
settings:<div>&nbsp;</div><div><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 $ Xvfb :100 -screen 0 9000x9000x24 &amp;</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">export \
DISPLAY=:100</blockquote> <div><br></div><div>The following code worked, however, it \
could only get <b>one-screen</b>&nbsp;info of the web page, namely, getting different \
amount of info according to the screen resolution. I could only get part of the info \
of the webpage.</div> <div><br></div><div>I have tried using <b>selenium </b>and I \
can get all web info if I set large screen resolution using \
<b>Xvfb</b>&nbsp;.</div><div><br></div><div>Please give me some tips about how to \
solve the problem and any manual for <b>QtWebKit </b>is also appreciated because I \
can&#39;t find more materials about it.</div> <div><br></div><div>And the following \
code can exit automatically after getting the work done. I can&#39;t find where&#39;s \
the bug&hellip;&hellip; Everytime I must use the command <b>kill</b>&nbsp;to \
terminate the script.</div><div><br></div> <div>Thanks \
anyway.</div><div><br></div><div>The following is my \
code:</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">#!/usr/bin/env \
python</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 #coding: utf-8</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 import sys</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 from PyQt4.QtCore import QUrl, SIGNAL, QSize</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">from \
PyQt4.QtGui import QApplication</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">from \
PyQt4.QtWebKit import QWebPage, QWebView</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">class \
WebPage(QWebPage):</blockquote><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp;&nbsp;</blockquote><blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; def javaScriptConsoleMessage(self, message, lineNumber, \
sourceID):</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; sys.stderr.write(&#39;Javascritp error at line number %d\n&#39; \
% (lineNumber))</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; sys.stderr.write(&#39;%s\n&#39; % (message, ))</blockquote> \
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; sys.stderr.write(&#39;Source ID: %s\n&#39; % (sourceID, \
))</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">class \
Crawler(QApplication):</blockquote> <blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp;&nbsp;</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; def __init__(self, url):</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; super(Crawler, self).__init__(sys.argv)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp;&nbsp;</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; self.url = url</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.web_view = QWebView()</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.web_page = WebPage()</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; \
self.web_view.setPage(self.web_page)</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.web_frame = self.web_page.currentFrame()</blockquote> \
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; self.qsize = QSize()</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.qsize.setHeight(9000)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.qsize.setWidth(9000)</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp;&nbsp;</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; # \
self.settings.setAttribute(QWebSettings.AutoLoadImages, \
False)</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; # \
self.setttings.setAttribute(QWebSettings.PluginsEnabled, \
False)</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; # self.setMaximumSize(10000, 10000)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; # \
self.web_page.setViewportSize(self.web_page.mainFrame().contentsSize())</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; print &#39;Before connecting&#39;</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.connect(self.web_view, \
SIGNAL(&#39;loadFinished(bool)&#39;), self.loadFinished)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; print &#39;After connecting&#39;</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; print &#39;Before loading&#39;</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp; self.web_frame.load(QUrl(self.url))</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; print &#39;After loading&#39;</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; &nbsp; &nbsp;&nbsp;</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; def loadFinished(self, ok):</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; print &#39;In callback, before \
writing&#39;</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; with open(&#39;jd.txt&#39;, &#39;ab+&#39;) as \
fp:</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \
fp.write(self.web_page.currentFrame().toHtml().toUtf8())</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; &nbsp; &nbsp; print &#39;In callback, after \
writing&#39;</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 if __name__ == &#39;__main__&#39;:</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; url = &#39;<a \
href="http://www.360buy.com/product/729487.html">http://www.360buy.com/product/729487.html</a>&#39;</blockquote>
 <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">&nbsp; \
&nbsp; crawler = Crawler(url)</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
 &nbsp; &nbsp; sys.exit(crawler.exec_())</blockquote><div><br></div></blockquote><div><br></div>-- \
<br>³èÈè²»¾ª£¬Ïп´Í¥Ç°»¨¿ª»¨ 䣻ȥÁôÎÞÒ⣬ þËæÌì±ßÔƾíÔÆÊæ¡£<br><div><br></div><br> \
</div></div>



_______________________________________________
PyQt mailing list    PyQt@riverbankcomputing.com
http://www.riverbankcomputing.com/mailman/listinfo/pyqt

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic