[prev in list] [next in list] [prev in thread] [next in thread] 

List:       fedora-list
Subject:    Re: downloading a complete web page without using a browser...
From:       "Stephen J. Turnbull" <stephen () xemacs ! org>
Date:       2021-07-06 5:31:29
Message-ID: 24803.60081.862405.900339 () turnbull ! sk ! tsukuba ! ac ! jp
[Download RAW message or body]

Samuel Sieb writes:
 > On 2021-07-03 8:02 p.m., dwoody5654@gmail.com wrote:
 > > the url I am trying to download does not have an extension ie. no
 > > '.htm' such as:
 > > https://my.acbl.org/club-results/details/338288

The extension doesn't matter to any of the utilities mentioned as far
as I know.  I'm pretty sure they get the MIME type from the HTTP
Content-Type header.

 > > wget does not download the correct web page.
 > 
 > I tried it and it worked, sort of.  The problem is that you want to 
 > download everything to view it offline, but the site my.acbl.org has a 
 > robots.txt that says "no robots allowed".  So wget respects that and 
 > will not download any required files from that site other than the 
 > initial page.  curl probably has the same issue.

1.  The page does not have content represented in HTML AFAICT: it's a
    blob which is parsed and formatted by a battery of (java)scripts,
    some of which are resources on the Internet, and some are inline.
    In other words, the HTML in that file is used as a container format
    to transport the scripts to the browser.
    Neither wget nor curl support Javascript at all as far as I know.

2.  96% of the page is in two blobs; AFAICT there were no IMG or other
    elements that specify requirements by URL.  If so, that would
    explain why only the top page was downloaded.

3.  curl does not document how it handles robots.txt.  Since as far as
    I can tell curl has no recursive or get-requirements option, it
    probably doesn't handle it at all.
    wget documents that wget -r (recursive downloads) respects
    robots.txt.  It does not document that wget -p (get page
    requisites, too) respects robots.txt, but a quick test suggests
    that it does.  I think this is a bug: any interactive program that
    supports non-text media will download required resources with the
    access to the HTML file.  (If someone agrees and wants to do
    something about it, this is a wget bug, not a Fedora bug.)

I don't have an alternative fetch tool to suggest, unfortunately.  I
think that you need to use a graphical browser somehow, or write a
script in your favorite P-language.

Steve
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic