'Re: [wwwsearch-general] Question on OS web spider like mechanize'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wwwsearch-general
Subject:    Re: [wwwsearch-general] Question on OS web spider like mechanize
From:       John J Lee <jjl () pobox ! com>
Date:       2010-09-04 20:17:44
Message-ID: alpine.DEB.2.00.1009042013520.4057 () alice ! cable ! virginmedia ! net
[Download RAW message or body]

On Thu, 2 Sep 2010, Chris Nizzardini wrote:

> I will be creating an application to perform some specialized web crawling
> for a client and would like to use one of the various open source spiders.
> My primary language is PHP, but I have some limited experience with Pyhton.
> My question is which would be faster (operation wise) and easiest to
> extend/modify: a python tool (such as mechanize) or a php tool (such as
> phpdig).  Please NO ZEALOTRY in the python versus php fashion.  I don't care
> which language you believe is better.  I am only looking for the right tool
> for the right job in this case.  Any respectful insight on a solution is
> appreciated.

When you say faster operation wise, I assume you mean throughput rather 
than latency?  What order of magnitude of number of pages / bytes is 
involved, and how often will you download pages?

mechanize has had little attention paid to its performance.  It hasn't 
been a big issue for me: IIRC the largest scrape I have done used the code 
in earlier form to fetch about 1E7 records.  If you're prepared to open 
many connections simultaneously to a server, it's likely to be fairly easy 
to scale up throughput using multiple processes, though (but be careful 
not to overload servers).

One obvious issue if you use a single process to download many web pages 
using mechanize is that you probably want to use a different history 
object (e.g. a null implementation), so that page data is not kept around 
forever.  Another is that no caching is done -- though obviously a caching 
proxy can be used for that.

My general suggestion would be to decide what performance you need, pick a 
shortlist of tools based on other requirements, then do a quick test to 
see if performance is adequate.

In what ways do you anticipate having to modify the tool you pick?

If JavaScript is involved in a non-trivial way, you might want to use 
HtmlUnit (a Java library that can be used from Jython).

I don't know anything about phpdig, except that I just googled it and see 
it's a search tool.  mechanize does not provide any search functionality. 
FWIW, if you are implementing search features, I was impressed with the 
low barrier to entry with Solr -- really extremely easy to get a quick 
full-text search going and integrated with data retrieval and the rest of 
your system (no Java knowledge required).

HTH


John


------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
wwwsearch-general mailing list
wwwsearch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/wwwsearch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic