[prev in list] [next in list] [prev in thread] [next in thread]
List: pykde
Subject: [PyQt] Get all web info including info generated by AJAX using QtWebKit
From: flyer <flyer103 () gmail ! com>
Date: 2012-10-16 3:11:43
Message-ID: CAKLbBG9q42Jr0gDDrc4RVzBunS4ZQF3Oeq0f9pyf9BipDt_B6Q () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
I wrote a python script using QtWebKit to get all page info including info
generated by AJAX requests. I run the following code on CentOS Server and
do the following settings:
> $ Xvfb :100 -screen 0 9000x9000x24 &
export DISPLAY=:100
The following code worked, however, it could only get *one-screen* info of
the web page, namely, getting different amount of info according to the
screen resolution. I could only get part of the info of the webpage.
I have tried using *selenium *and I can get all web info if I set large
screen resolution using *Xvfb* .
Please give me some tips about how to solve the problem and any manual
for *QtWebKit
*is also appreciated because I can't find more materials about it.
And the following code can exit automatically after getting the work done.
I can't find where's the bug¡¡ Everytime I must use the command *kill* to
terminate the script.
Thanks anyway.
The following is my code:
#!/usr/bin/env python
>
> #coding: utf-8
>
>
>> import sys
>
>
>> from PyQt4.QtCore import QUrl, SIGNAL, QSize
>
> from PyQt4.QtGui import QApplication
>
> from PyQt4.QtWebKit import QWebPage, QWebView
>
>
>>
>> class WebPage(QWebPage):
>
>
>
> def javaScriptConsoleMessage(self, message, lineNumber, sourceID):
>
> sys.stderr.write('Javascritp error at line number %d\n' %
>> (lineNumber))
>
> sys.stderr.write('%s\n' % (message, ))
>
> sys.stderr.write('Source ID: %s\n' % (sourceID, ))
>
>
>>
>> class Crawler(QApplication):
>
>
>
> def __init__(self, url):
>
> super(Crawler, self).__init__(sys.argv)
>
>
>
> self.url = url
>
> self.web_view = QWebView()
>
> self.web_page = WebPage()
>
> self.web_view.setPage(self.web_page)
>
> self.web_frame = self.web_page.currentFrame()
>
>
>> self.qsize = QSize()
>
> self.qsize.setHeight(9000)
>
> self.qsize.setWidth(9000)
>
>
>>
>
> # self.settings.setAttribute(QWebSettings.AutoLoadImages, False)
>
> # self.setttings.setAttribute(QWebSettings.PluginsEnabled, False)
>
>
>> # self.setMaximumSize(10000, 10000)
>
>
>> #
>> self.web_page.setViewportSize(self.web_page.mainFrame().contentsSize())
>
>
>> print 'Before connecting'
>
> self.connect(self.web_view, SIGNAL('loadFinished(bool)'),
>> self.loadFinished)
>
> print 'After connecting'
>
>
>> print 'Before loading'
>
> self.web_frame.load(QUrl(self.url))
>
> print 'After loading'
>
>
>
> def loadFinished(self, ok):
>
> print 'In callback, before writing'
>
> with open('jd.txt', 'ab+') as fp:
>
> fp.write(self.web_page.currentFrame().toHtml().toUtf8())
>
> print 'In callback, after writing'
>
>
>>
>> if __name__ == '__main__':
>
> url = 'http://www.360buy.com/product/729487.html'
>
> crawler = Crawler(url)
>
> sys.exit(crawler.exec_())
>
>
>
--
³èÈè²»¾ª£¬Ïп´Í¥Ç°»¨¿ª»¨ 䣻ȥÁôÎÞÒ⣬ þËæÌì±ßÔƾíÔÆÊæ¡£
[Attachment #5 (text/html)]
I wrote a python script using QtWebKit to get all page info including info generated \
by AJAX requests. I run the following code on CentOS Server and do the following \
settings:<div> </div><div><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
$ Xvfb :100 -screen 0 9000x9000x24 &</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">export \
DISPLAY=:100</blockquote> <div><br></div><div>The following code worked, however, it \
could only get <b>one-screen</b> info of the web page, namely, getting different \
amount of info according to the screen resolution. I could only get part of the info \
of the webpage.</div> <div><br></div><div>I have tried using <b>selenium </b>and I \
can get all web info if I set large screen resolution using \
<b>Xvfb</b> .</div><div><br></div><div>Please give me some tips about how to \
solve the problem and any manual for <b>QtWebKit </b>is also appreciated because I \
can't find more materials about it.</div> <div><br></div><div>And the following \
code can exit automatically after getting the work done. I can't find where's \
the bug…… Everytime I must use the command <b>kill</b> to \
terminate the script.</div><div><br></div> <div>Thanks \
anyway.</div><div><br></div><div>The following is my \
code:</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">#!/usr/bin/env \
python</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
#coding: utf-8</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
import sys</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
from PyQt4.QtCore import QUrl, SIGNAL, QSize</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">from \
PyQt4.QtGui import QApplication</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">from \
PyQt4.QtWebKit import QWebPage, QWebView</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">class \
WebPage(QWebPage):</blockquote><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
</blockquote><blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
def javaScriptConsoleMessage(self, message, lineNumber, \
sourceID):</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
sys.stderr.write('Javascritp error at line number %d\n' \
% (lineNumber))</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
sys.stderr.write('%s\n' % (message, ))</blockquote> \
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
sys.stderr.write('Source ID: %s\n' % (sourceID, \
))</blockquote> <blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">class \
Crawler(QApplication):</blockquote> <blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
def __init__(self, url):</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
super(Crawler, self).__init__(sys.argv)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
self.url = url</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.web_view = QWebView()</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.web_page = WebPage()</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
\
self.web_view.setPage(self.web_page)</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.web_frame = self.web_page.currentFrame()</blockquote> \
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
self.qsize = QSize()</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.qsize.setHeight(9000)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.qsize.setWidth(9000)</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
# \
self.settings.setAttribute(QWebSettings.AutoLoadImages, \
False)</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
# \
self.setttings.setAttribute(QWebSettings.PluginsEnabled, \
False)</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
# self.setMaximumSize(10000, 10000)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
# \
self.web_page.setViewportSize(self.web_page.mainFrame().contentsSize())</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
print 'Before connecting'</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.connect(self.web_view, \
SIGNAL('loadFinished(bool)'), self.loadFinished)</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
print 'After connecting'</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
print 'Before loading'</blockquote> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
self.web_frame.load(QUrl(self.url))</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
print 'After loading'</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
</blockquote> <blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
def loadFinished(self, ok):</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
print 'In callback, before \
writing'</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
with open('jd.txt', 'ab+') as \
fp:</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
\
fp.write(self.web_page.currentFrame().toHtml().toUtf8())</blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
print 'In callback, after \
writing'</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
if __name__ == '__main__':</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
url = '<a \
href="http://www.360buy.com/product/729487.html">http://www.360buy.com/product/729487.html</a>'</blockquote>
<blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> \
crawler = Crawler(url)</blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
sys.exit(crawler.exec_())</blockquote><div><br></div></blockquote><div><br></div>-- \
<br>³èÈè²»¾ª£¬Ïп´Í¥Ç°»¨¿ª»¨ 䣻ȥÁôÎÞÒ⣬ þËæÌì±ßÔƾíÔÆÊæ¡£<br><div><br></div><br> \
</div></div>
_______________________________________________
PyQt mailing list PyQt@riverbankcomputing.com
http://www.riverbankcomputing.com/mailman/listinfo/pyqt
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic