[prev in list] [next in list] [prev in thread] [next in thread]
List: wwwsearch-general
Subject: Re: [wwwsearch-general] Question on OS web spider like mechanize
From: John J Lee <jjl () pobox ! com>
Date: 2010-09-04 20:17:44
Message-ID: alpine.DEB.2.00.1009042013520.4057 () alice ! cable ! virginmedia ! net
[Download RAW message or body]
On Thu, 2 Sep 2010, Chris Nizzardini wrote:
> I will be creating an application to perform some specialized web crawling
> for a client and would like to use one of the various open source spiders.
> My primary language is PHP, but I have some limited experience with Pyhton.
> My question is which would be faster (operation wise) and easiest to
> extend/modify: a python tool (such as mechanize) or a php tool (such as
> phpdig). Please NO ZEALOTRY in the python versus php fashion. I don't care
> which language you believe is better. I am only looking for the right tool
> for the right job in this case. Any respectful insight on a solution is
> appreciated.
When you say faster operation wise, I assume you mean throughput rather
than latency? What order of magnitude of number of pages / bytes is
involved, and how often will you download pages?
mechanize has had little attention paid to its performance. It hasn't
been a big issue for me: IIRC the largest scrape I have done used the code
in earlier form to fetch about 1E7 records. If you're prepared to open
many connections simultaneously to a server, it's likely to be fairly easy
to scale up throughput using multiple processes, though (but be careful
not to overload servers).
One obvious issue if you use a single process to download many web pages
using mechanize is that you probably want to use a different history
object (e.g. a null implementation), so that page data is not kept around
forever. Another is that no caching is done -- though obviously a caching
proxy can be used for that.
My general suggestion would be to decide what performance you need, pick a
shortlist of tools based on other requirements, then do a quick test to
see if performance is adequate.
In what ways do you anticipate having to modify the tool you pick?
If JavaScript is involved in a non-trivial way, you might want to use
HtmlUnit (a Java library that can be used from Jython).
I don't know anything about phpdig, except that I just googled it and see
it's a search tool. mechanize does not provide any search functionality.
FWIW, if you are implementing search features, I was impressed with the
low barrier to entry with Solr -- really extremely easy to get a quick
full-text search going and integrated with data retrieval and the rest of
your system (no Java knowledge required).
HTH
John
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:
Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
wwwsearch-general mailing list
wwwsearch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/wwwsearch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic