'Future of Wget'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wget
Subject:    Future of Wget
From:       Micah Cowan <micah () cowan ! name>
Date:       2008-01-27 11:19:28
Message-ID: 479C68C0.3010000 () cowan ! name
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So, I've been spending some time thinking about how I want to break
things down in the next several versions of Wget. Here's a summary of
how I think it'll break down; following it is a more in-depth discussion.

1.12
- ----
  Support for parsing links from CSS.
  Polishing and bug-fixing of Content-Disposition support.
  RFC-compliant HTTP authentication.
  Support for IRIs, or at least basic support for IDNs.
  GNUTLS.

1.13
- ----
  Many, many small bugfixes and improvements.
  HTTP/1.1 support, especially Transfer-Encoding.
  A streaming HTML parser.
  Support for basic external program filters.
    Filtering HTML before processing
    Filtering Content-Encoding (e.g., gzip)
    Filtering names of locally-saved files.
    Registered filter handlers for non-HTML content types, to allow
traversal of links in (e.g.) PDFs or XML, etc.
  The "metadata database"

2.0
- ----
  Support for dynamically-loaded modules:
    Modular protocol handling.
    Modular transport-layer handling (SOCKS, SCTP, connection via the
HTTP CONNECT method, TLS, IP over Carrier Pigeon, ...).
    Modular filestore support (Download to tarball, .mht, ...).
  Revamped commandline and .wgetrc:
    Support for host- and path-specific configuration settings.
    Regex support.

2.1
- ----
  Support for multiple simultaneous connections?

1.12
- ----
The really big deal here, to me, is CSS. I want to have CSS support for
Wget ASAP. It's an essential part of the Web, and users definitely
suffer for the lack of support for it.

I'd love to just get that done, and get it out; but there are a number
of other things that for me are too important to hold off. For one, the
Content-Disposition thing needs some cleaning up, polishing, and a bit
of bugfixing. And, the HTTP authentication code gained some improvement
in 1.11, but still isn't RFC-compliant, and I'd really like it brought
up to speed.

Internationalization is quickly becoming more and more of an issue. I
had been planning on putting that off for a couple revisions, but I
don't think I can do that any longer. At the very least, we need to
handle IDNs, as it looks like ICANN will be introducing IDN ccTLDs in
the latter half of 2008:
http://ccnso.icann.org/workinggroups/idn-time-table-19dec07.htm

While implementing IDN support wouldn't be terribly difficult, it would
be much more effective as part of a more complete support for IRIs in
general; especially, transcoding between arguments entered in the user's
locale to an IRI, and back to the user's character encoding for
filenames. I don't know if all this can be accomplished for 1.12, but
getting _something_ accomplished toward this end would be a very good idea.

Support for GNUTLS would also be great to have in 1.12, if it's not too
much effort. As I understand it, there are just a few kinks to be worked
out for that, in which case it'd be worth trying to get in for then.

1.13
- ----

In my haste to get these all-important features out the door, a lot of
other, lesser but worthy improvements will have gone by the wayside.
1.13 will be when I finally have a little breathing room to tackle some
of these. I've gone through all the bug reports very carefully, and
everything that I could see myself postponing has been shifted from 1.12
to 1.13 in Savannah. This includes several "pet" issues I'd really like
to see go in for 1.12--and many of them really wouldn't take too
long--but together they add up quite a bit, and considering that even
without them, 1.12 has got a pile of work to be done, it's best to
postpone them.

Among some of the more interesting and significant things I'd like to do
in 1.13, however, is the ability to have Wget call out to external
programs for further processing. A good example is the recent
- --transform-html proposal, for post-processing downloaded HTML files
(though for that and some other use cases, it might make more sense to
feed the already-parsed links to the filter, rather than the entire HTML
file); this idea could also be adapted for processing Content-Encoded
resources (say, compressed via gzip). Registering handlers for non-HTML
content types, to spit out more URLs, could also be useful; as would be
filters to translate URLs into paths suitable for placing downloaded files.

Reworking the HTML parsing code to be able to handle streamed content
instead of slurping everything in memory would then be advantageous, as
it would avoid the use of temporary files, and allow Wget to process
data as it comes, rather than after everything's been received. This
would make particular sense in feeding URLs to filter programs.

HTTP/1.1 support, especially support for the "chunked"
Transfer-Encoding, would augment Wget's existing support for persistent
connections. Web sites are relying more and more on dynamic content
generation, and when a dynamically generated page means that the remote
server can't know in advance what the Content-Length should be, its only
option to indicate to Wget that a file has completed transfer is to
close the connection. Note that, in addition to being inefficient, a
connection that was closed in order to indicate the end of a resource is
indistinguishable from a connection that was lost, and did not complete.
The "chunked" transfer encoding enables servers to indicate the end of
dynamically-generated content, so that the connection may be kept open.

With filtered filenames--and already for features like
Content-Disposition, it will be extremely useful to have a file that
maps between local filenames and the URLs from which they came; instead
of having to perform a GET on a URL, discover that it has a
Content-Disposition header, look for the local file corresponding to
that header, and check the timestamps and/or accept/reject lists and
terminate the connection (a nasty habit to get into with persistent
connections), Wget would be able to just look up the URI in the local
database, find out what local file it mapped to, and perform the
appropriate checks based on that information.

Also, with IRI support, it will be important for Wget to obtain
information about a resource's character encoding, so that it knows how
to interpret the IRIs found within it. This kind of information, and the
URI <-> filename mapping, will be stored in what I've been calling, the
"metadata database."

2.0 and beyond
- ----

I'm not going to talk much about the rest here, except to say that
multiple-connection support is not likely to get any of my attention
until a long time from now. There are simply too many other important
matters to address. As far as Wget's current functionality goes,
multiple connections won't drastially improve most use cases, since we'd
limit the number of connections to a single host anyway: the biggest
gains would be in multi-host recursive fetches, and extensions like the
MetaLink functionality. For these areas, of course, it would bring great
benefit, but the amount of work involved in achieving it simply isn't
feasible for the time being.

All the other features I've mentioned can be adapted on top of the
current code base without an inordinate amount of trouble. However,
supporting multiple connections would require a complete reworking of
pretty much every aspect of Wget's code. It's better to hold off on that
so we can focus on more pressing needs.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHnGjA7M8hyUobTrERAvXvAJ43cIMZQvnoG/0gOG+SFqMV9FpGvwCff8T4
WcIkduwSWb6ZQDBVkxpKfDg=
=BDMJ
-----END PGP SIGNATURE-----
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic