[prev in list] [next in list] [prev in thread] [next in thread] 

List:       fedora-list
Subject:    Re: offtopic questions??
From:       Cameron Simpson <cs () cskk ! id ! au>
Date:       2018-09-23 0:12:25
Message-ID: 20180923001225.GA73087 () cskk ! homeip ! net
[Download RAW message or body]

On 22Sep2018 19:10, bruce <badouglas@gmail.com> wrote:
>My questions would probably be how to speed up something, or how to
>possibly redo/re-architect part of the crawl process.

Well, it is technically not Fedora specific, but this place seems the most 
active shell-related list I'm on. So "how do I improve this web crawling shell 
script?" might be ok. Disclaimer: not a list admin.

>As an example, I have a situation where I use cheap cloud vms
>(digitalocean) to perform the fetches. The fetches are basic "curl"
>with the required attributes. The curl also includes a "Cookie" for
>the curl/fetch for the target server. When running from the given ip
>address of the vm the target "blocks" the fetch. (I guess someone else
>cold have tried to fetch a bunch earlier -- who knows). So, I use an
>anonymous proxy-server ip to then generate the fetch. This process
>works, but it's slow. So, the process runs a number of these in
>parallel at the same time on the cheap droplet. While this speeds
>things up, still "slow"... I've also tested running curl with multi
>"http" urls in the same curl.

Multiple URLs in wget or curl are done in series.

You can get quite agressive with parallelism in the shell, but it isn't the 
best thing for fine grained control of lots of subprocesses because you can't 
trivially "wait for a _single_ one of my subprocesses to complete", so the 
obvious "read URLs and dispatch background curls up to some limit, then wait 
for one to complete before kicking off the next" isn't so easy. (You can wait 
for a specific pid, but that's no help when you don't know which fetch will 
complete first.)

You can do things like firing them off in their own subshell which writes a 
line to a file or pipe on completion, then track completion by monitoring that 
log.

A burstly alternative looks like this:

  ( # subshell to ensure no unexpected children
    maxbg=128   # pick a number
    nbg=0
    while read -r url <&3
    do
      curl ... "$url" &
      nbg=$(( nbg + 1 ))
      [ $nbg -lt $maxbg ] || { wait; nbg=0; }
    done 3<urls.txt
    wait
  )

which first of bursts of 128 curls and waits for them all, then runs another 
burst etc.

If you want finer control, maybe move to Python using the requests library to 
do fetches and threads for the parallelism. But then you need to learn Python 
(highly recommended anyway, but a further hurdle to your initial task).

Cheers,
Cameron Simpson <cs@cskk.id.au>
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic