Download several files: part 2

In an earlier post we used wget to download a single image file, and then used it to get all of the ‘gif’ and ‘jpg’ files from a single command. Multi-download commands of this type are helpful when you know the URL and exact directory where the image files exist. Let’s now take it a step further, and get lazy too. Lazy?? Yes, lazy. Since we’re looking to use Linux for time-saving shortcuts, the less work we have to do to get to accomplish our task, the better.

As previously mentioned, I like podcasts. Podcasts are (usually) available in an RSS feed in the form of a web URL. Programs such as ITunes, Amarok, Rhythmbox (or other) use feed URLs to get info about the available audio files and you can manually download them or set up preferences that do this for you.

We’re going to look at this from a “get me all the files—now” approach using the Linux command line.

To perform a multi-file “unattended” download…

Make sure that “lynx” (a terminal-based web browser) is installed. To check if lynx is installed, type which lynx at the prompt. If the shell responds with nothing but the next prompt, then it’s not installed. To install lynx and you’re on a debian-based OS such as Ubuntu (or similar) type “sudo aptitude install lynx” at the prompt. If you’re using a redhat-based system type “yum install lynx” to accomplish the same. When lynx is installed, the shell will return the executable path of lynx (it might appear as /usr/local/bin/lynx) when you type “which lynx” at your prompt.
Make sure you have wget installed. In the terminal, type “which wget” and see what the shell returns. If it’s not on your system then install it. Items one and two only have to be done once, if at all. I think wget will be there, but lynx is probably not included out of the box at install time.
A URL (or RSS feed URL) where the desired files exist.

Here’s our practical example. Let’s download all the mp3 files at Steven J. Cohen’s “Doctor Who” RSS Feed. You should view this link in a web browser to make sure that the page/feed is still there.

Time for a “trial-run” (this next command will not download, just list the mp3 files at the Feed URL).

lynx -dump http://www.stevenjaycohen.com/audio/drwho/feed | egrep -o "http:.*mp3"

lynx -dump [URL] returns a numbered list of web links from a given web page (for the complete HTML source, use lynx -source [URL]). Since we only want the links, (and not the numbering) we need to filter this list using the UNIX pipe character “|” and the search tool egrep -o [pattern]. We put in “http:.*mp3” as our pattern which will capture any link that starts with http and ends with mp3 (note the .* is a wildcard meaning `any character`). A word of caution. It’s ALWAYS a good idea to do a trial run so that you have an idea of what you will request for download and if your command will succeed in building the list properly. This is a very important preliminary step.

Now, let’s do this for real. The following command downloads files into the current directory of the shell. So if you execute the command from “/home/myUserName/music” then the files get saved into “music”.

lynx -dump http://www.stevenjaycohen.com/audio/drwho/feed | egrep -o "http:.*mp3" | xargs -n1 wget

And that’s it. The shell shows progress of each file as it downloads. When it’s done with the first file, it downloads the next one, and so on. It runs unattended, allowing you to do other things with your time.

To perform the “unattended” download of all the files specified in the list, we needed another pipe, and another command structure known as “xargs”. Why xargs? Sometimes the shell runs into a problem of having “too many arguments” in its list to act on. xargs is your friend should this happen.

xargs [options] [command]. The option and the command work together as follows. Option “-n1” directs the command “wget” to work one time per each url from the list resulting from the “lynx -dump” part of the command. Like many shell commands, there’s usually more than one way to do it.

3 replies on “Download several files: part 2”

Hi,
Thanks for your tutorial, it was a big help.
Is there a way to use this method on websites that use an https connection?

Hi Kirsten, thanks for your comment. As far as I can tell, wget should work with https sites, but you may want to check how wget was compiled on your system. It needs SSL in its config if it’s not there already.

“To support encrypted HTTP (HTTPS) downloads, Wget must be compiled with an external SSL library, currently OpenSSL. If Wget is compiled without SSL support, none of these options are available.”
http://www.gnu.org/software/wget/manual/html_node/HTTPS-_0028SSL_002fTLS_0029-Options.html

Hope this helps,
Adam

Hello There. I discovered your blog using msn. That is a very well written article. I’ll make sure to bookmark it and come back to learn more of your helpful info. Thank you for the post. I will certainly return.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To perform a multi-file “unattended” download…

3 replies on “Download several files: part 2”

Leave a Reply