[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wget
Subject:    wget downloads files even if -R specifies otherwise
From:       "Edward J. Sabol" <sabol () alderaan ! gsfc ! nasa ! gov>
Date:       1998-10-28 22:47:54
[Download RAW message or body]

Hi,

Though I've been using wget since 1.4.3, I just joined the wget mailing list.
I think wget is the one best things since sliced bread. Thanks, Hrovje, et
al., for such an amazingly useful program!

I've noticed that, if you use wget -R to reject certain filenames, wget will
download these files anyway and then delete them afterwards. This is with a
bunch of command line options, but probably the most important are "-r -l1
--timestamping -A html -R file1.html,file2.html,file3.html". Here's an exact
sample command that displays this behavior:

wget -r -A html -R PersonalInfo.html --no-parent -l1 \
http://lheawww.gsfc.nasa.gov/~sabol/

Ideally, I'd like to avoid downloading these rejected files altogether. Is
there a way to make wget reject them before downloading them? Am I doing
something wrong or is wget just not being as efficient as it could be?

This is with version 1.5.3.

Thanks,
Ed

P.S. Here's the output of the above command with '-d' flag (sorry, it's
rather long):

DEBUG output created by Wget 1.5.3 on irix5.3.

parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host lheawww.gsfc.nasa.gov -> \
opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Checking for \
lheawww.gsfc.nasa.gov. This is the first time I hear about host lheawww.gsfc.nasa.gov \
                by that name.
--17:44:19--  http://lheawww.gsfc.nasa.gov:80/%7Esabol/
           => `lheawww.gsfc.nasa.gov/%7Esabol/index.html'
Connecting to lheawww.gsfc.nasa.gov:80... Created fd 4.
connected!
---request begin---
GET /%7Esabol/ HTTP/1.0
User-Agent: Wget/1.5.3
Host: lheawww.gsfc.nasa.gov:80
Accept: */*

---request end---
HTTP request sent, awaiting response... HTTP/1.0 200 OK
Server: Netscape-Enterprise/2.01
Date: Wed, 28 Oct 1998 22:44:19 GMT
Content-type: text/html


Length: unspecified [text/html]

    0K -> .

Closing fd 4
17:44:24 (367.28 B/s) - `lheawww.gsfc.nasa.gov/%7Esabol/index.html' saved [1843]

parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host lheawww.gsfc.nasa.gov -> \
opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Loaded HTML file \
lheawww.gsfc.nasa.gov/%7Esabol/index.html (size 1843). Resetting a parser state.
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: /~sabol/PersonalInfo.html; constr: \
http://lheawww.gsfc.nasa.gov:80/~sabol/PersonalInfo.html file \
lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: /~sabol/Favorites.html; constr: \
http://lheawww.gsfc.nasa.gov:80/~sabol/Favorites.html file \
lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://heasarc.gsfc.nasa.gov/docs/bios/sabol.html; constr: \
http://heasarc.gsfc.nasa.gov/docs/bios/sabol.html file \
lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.stx.com/; constr: http://www.stx.com/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.nasa.gov/; constr: http://www.nasa.gov/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.gsfc.nasa.gov/; constr: http://www.gsfc.nasa.gov/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://heasarc.gsfc.nasa.gov/; constr: http://heasarc.gsfc.nasa.gov/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.pittsburgh.net/; constr: http://www.pittsburgh.net/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.cmu.edu/; constr: http://www.cmu.edu/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: http://www.phys.cmu.edu/; constr: http://www.phys.cmu.edu/
file lheawww.gsfc.nasa.gov/%7Esabol/index.html; this_url \
                http://lheawww.gsfc.nasa.gov:80/%7Esabol/; base (null)
link: /~sabol/; constr: http://lheawww.gsfc.nasa.gov:80/~sabol/
HTML parser ends here (state destroyed).
parseurl ("http://lheawww.gsfc.nasa.gov:80/~sabol/PersonalInfo.html") -> host \
lheawww.gsfc.nasa.gov -> port 80 -> opath ~sabol/PersonalInfo.html -> dir ~sabol -> \
file PersonalInfo.html -> ndir ~sabol Checking for lheawww.gsfc.nasa.gov.
lheawww.gsfc.nasa.gov was already used, by that name.
Comparing hosts lheawww.gsfc.nasa.gov and lheawww.gsfc.nasa.gov...
They are quite alike.
I've decided to load it -> parseurl \
("http://lheawww.gsfc.nasa.gov:80/%7Esabol/PersonalInfo.html") -> host \
lheawww.gsfc.nasa.gov -> port 80 -> opath %7Esabol/PersonalInfo.html -> dir ~sabol -> \
file PersonalInfo.html -> ndir ~sabol Checking for lheawww.gsfc.nasa.gov.
lheawww.gsfc.nasa.gov was already used, by that name.
--17:44:24--  http://lheawww.gsfc.nasa.gov:80/%7Esabol/PersonalInfo.html
           => `lheawww.gsfc.nasa.gov/%7Esabol/PersonalInfo.html'
Connecting to lheawww.gsfc.nasa.gov:80... Created fd 4.
connected!
---request begin---
GET /%7Esabol/PersonalInfo.html HTTP/1.0
User-Agent: Wget/1.5.3
Host: lheawww.gsfc.nasa.gov:80
Accept: */*
Referer: http://lheawww.gsfc.nasa.gov:80/%7Esabol/

---request end---
HTTP request sent, awaiting response... HTTP/1.0 200 OK
Server: Netscape-Enterprise/2.01
Date: Wed, 28 Oct 1998 22:44:24 GMT
Content-type: text/html


Length: unspecified [text/html]

    0K -> ..

Closing fd 4
17:44:29 (541.81 B/s) - `lheawww.gsfc.nasa.gov/%7Esabol/PersonalInfo.html' saved \
[2786]

Recursion depth 2 exceeded max. depth 1.
Removing lheawww.gsfc.nasa.gov/%7Esabol/PersonalInfo.html since it should be \
rejected. http://lheawww.gsfc.nasa.gov:80/%7Esabol/PersonalInfo.html already in list, \
so we don't load. parseurl ("http://lheawww.gsfc.nasa.gov:80/~sabol/Favorites.html") \
-> host lheawww.gsfc.nasa.gov -> port 80 -> opath ~sabol/Favorites.html -> dir ~sabol \
-> file Favorites.html -> ndir ~sabol Checking for lheawww.gsfc.nasa.gov.
lheawww.gsfc.nasa.gov was already used, by that name.
Comparing hosts lheawww.gsfc.nasa.gov and lheawww.gsfc.nasa.gov...
They are quite alike.
I've decided to load it -> parseurl \
("http://lheawww.gsfc.nasa.gov:80/%7Esabol/Favorites.html") -> host \
lheawww.gsfc.nasa.gov -> port 80 -> opath %7Esabol/Favorites.html -> dir ~sabol -> \
file Favorites.html -> ndir ~sabol Checking for lheawww.gsfc.nasa.gov.
lheawww.gsfc.nasa.gov was already used, by that name.
--17:44:29--  http://lheawww.gsfc.nasa.gov:80/%7Esabol/Favorites.html
           => `lheawww.gsfc.nasa.gov/%7Esabol/Favorites.html'
Connecting to lheawww.gsfc.nasa.gov:80... Created fd 4.
connected!
---request begin---
GET /%7Esabol/Favorites.html HTTP/1.0
User-Agent: Wget/1.5.3
Host: lheawww.gsfc.nasa.gov:80
Accept: */*
Referer: http://lheawww.gsfc.nasa.gov:80/%7Esabol/

---request end---
HTTP request sent, awaiting response... HTTP/1.0 200 OK
Server: Netscape-Enterprise/2.01
Date: Wed, 28 Oct 1998 22:44:29 GMT
Content-type: text/html


Length: unspecified [text/html]

    0K -> ..........

Closing fd 4
17:44:34 (1.99 KB/s) - `lheawww.gsfc.nasa.gov/%7Esabol/Favorites.html' saved [10565]

Recursion depth 2 exceeded max. depth 1.
http://lheawww.gsfc.nasa.gov:80/%7Esabol/Favorites.html already in list, so we don't \
load. parseurl ("http://heasarc.gsfc.nasa.gov/docs/bios/sabol.html") -> host \
heasarc.gsfc.nasa.gov -> opath docs/bios/sabol.html -> dir docs/bios -> file \
sabol.html -> ndir docs/bios parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> \
host lheawww.gsfc.nasa.gov -> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol \
Trying to escape parental guidance with no_parent on. \
http://heasarc.gsfc.nasa.gov:80/docs/bios/sabol.html already in list, so we don't \
load. parseurl ("http://www.stx.com/") -> host www.stx.com -> opath  -> dir  -> file  \
-> ndir  parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host \
lheawww.gsfc.nasa.gov -> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying \
to escape parental guidance with no_parent on. http://www.stx.com:80/ already in \
list, so we don't load. parseurl ("http://www.nasa.gov/") -> host www.nasa.gov -> \
opath  -> dir  -> file  -> ndir  parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> \
host lheawww.gsfc.nasa.gov -> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol \
Trying to escape parental guidance with no_parent on. http://www.nasa.gov:80/ already \
in list, so we don't load. parseurl ("http://www.gsfc.nasa.gov/") -> host \
www.gsfc.nasa.gov -> opath  -> dir  -> file  -> ndir  parseurl \
("http://lheawww.gsfc.nasa.gov/~sabol/") -> host lheawww.gsfc.nasa.gov -> opath \
~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying to escape parental guidance with \
no_parent on. http://www.gsfc.nasa.gov:80/ already in list, so we don't load.
parseurl ("http://heasarc.gsfc.nasa.gov/") -> host heasarc.gsfc.nasa.gov -> opath  -> \
dir  -> file  -> ndir  parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host \
lheawww.gsfc.nasa.gov -> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying \
to escape parental guidance with no_parent on. http://heasarc.gsfc.nasa.gov:80/ \
already in list, so we don't load. parseurl ("http://www.pittsburgh.net/") -> host \
www.pittsburgh.net -> opath  -> dir  -> file  -> ndir  parseurl \
("http://lheawww.gsfc.nasa.gov/~sabol/") -> host lheawww.gsfc.nasa.gov -> opath \
~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying to escape parental guidance with \
no_parent on. http://www.pittsburgh.net:80/ already in list, so we don't load.
parseurl ("http://www.cmu.edu/") -> host www.cmu.edu -> opath  -> dir  -> file  -> \
ndir  parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host lheawww.gsfc.nasa.gov \
-> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying to escape parental \
guidance with no_parent on. http://www.cmu.edu:80/ already in list, so we don't load.
parseurl ("http://www.phys.cmu.edu/") -> host www.phys.cmu.edu -> opath  -> dir  -> \
file  -> ndir  parseurl ("http://lheawww.gsfc.nasa.gov/~sabol/") -> host \
lheawww.gsfc.nasa.gov -> opath ~sabol/ -> dir ~sabol -> file  -> ndir ~sabol Trying \
to escape parental guidance with no_parent on. http://www.phys.cmu.edu:80/ already in \
list, so we don't load. parseurl ("http://lheawww.gsfc.nasa.gov:80/~sabol/") -> host \
lheawww.gsfc.nasa.gov -> port 80 -> opath ~sabol/ -> dir ~sabol -> file  -> ndir \
~sabol http://lheawww.gsfc.nasa.gov:80/%7Esabol/ already in list, so we don't load.

FINISHED --17:44:34--
Downloaded: 15,194 bytes in 3 files


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic