[prev in list] [next in list] [prev in thread] [next in thread]
List: fedora-list
Subject: Re: downloading a complete web page without using a browser...
From: D&R <dwoody5654 () gmail ! com>
Date: 2021-07-06 15:05:42
Message-ID: 20210706100542.45ffe6f0 () star1 ! home ! com
[Download RAW message or body]
On Tue, 6 Jul 2021 14:31:29 +0900
stephen@xemacs.org wrote:
> Samuel Sieb writes:
> > On 2021-07-03 8:02 p.m., dwoody5654@gmail.com wrote:
> > > the url I am trying to download does not have an extension ie. no
> > > '.htm' such as:
> > > https://my.acbl.org/club-results/details/338288
>
> The extension doesn't matter to any of the utilities mentioned as far
> as I know. I'm pretty sure they get the MIME type from the HTTP
> Content-Type header.
>
> > > wget does not download the correct web page.
> >
> > I tried it and it worked, sort of. The problem is that you want to
> > download everything to view it offline, but the site my.acbl.org has a
> > robots.txt that says "no robots allowed". So wget respects that and
> > will not download any required files from that site other than the
> > initial page. curl probably has the same issue.
>
> 1. The page does not have content represented in HTML AFAICT: it's a
> blob which is parsed and formatted by a battery of (java)scripts,
> some of which are resources on the Internet, and some are inline.
> In other words, the HTML in that file is used as a container format
> to transport the scripts to the browser.
> Neither wget nor curl support Javascript at all as far as I know.
>
> 2. 96% of the page is in two blobs; AFAICT there were no IMG or other
> elements that specify requirements by URL. If so, that would
> explain why only the top page was downloaded.
>
> 3. curl does not document how it handles robots.txt. Since as far as
> I can tell curl has no recursive or get-requirements option, it
> probably doesn't handle it at all.
> wget documents that wget -r (recursive downloads) respects
> robots.txt. It does not document that wget -p (get page
> requisites, too) respects robots.txt, but a quick test suggests
> that it does. I think this is a bug: any interactive program that
> supports non-text media will download required resources with the
> access to the HTML file. (If someone agrees and wants to do
> something about it, this is a wget bug, not a Fedora bug.)
>
> I don't have an alternative fetch tool to suggest, unfortunately. I
> think that you need to use a graphical browser somehow, or write a
> script in your favorite P-language.
>
> Steve
Thanks for the info.
I have been using a script called save-page-as.sh that runs firefox. I have
changed the save-page-as in firefox to use the 'Web Page, complete' The
savd-page-as.sh script sends a ctls-s to firefox and saves the page. It works
perfectly when run from the command line. I have tried to use the
save-page-as.sh script by sending an email to my computer. It does not run
firefox for some reason. In searching it says that firefox can be run from a
cron script by exporting DISPLAY. Running from a cron script, I would
think, is similar to running a script from an email (using procmailrc) no luck
, however.
env shows :0.0. I have tried several variations:
export DISPLAY=:0
export DISPLAY=:0.0
export DISPLAY=:0.1
with no luck.
Perhaps there is another setting that need to be included as well.
Any thoughts?
David
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-leave@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List
> Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List
> Archives:
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic