Posts Tagged ‘wget’

Favorite CLI Linux Apps: Lynx

02.19.10

Posted by adamlinuxhelp  |  1 Comment »

Lynx is a text-only web browser that runs in the shell.

Lynx is useful tool for those times when you want to extract only the web links from a web page.  Install lynx using the Synaptic (or other) package manager.

To view the hyperlinks of a given web page (google.com in this example), issue the command

lynx -dump http://www.google.com

It can also behave in a similar way to wget when you want to view the HTML source code of a web page.  The command to view the HTML source code is

lynx -source http://www.example.com

Click the following link to view a post where we collected links to mp3 files to build an unattended download list for the wget command.  Another feature of Lynx is that it allows you to view your pages as a web crawler/robot such as googlebot might see them.

Download several files: part 2

01.14.10

Posted by adamlinuxhelp  |  2 Comments »

In an earlier post we used wget to download a single image file, and then used it to get all of the ‘gif’ and ‘jpg’ files from a single command.  Multi-download commands of this type are helpful when you know the URL and exact directory where the image files exist.  Let’s now take it a step further, and get lazy too.  Lazy?? Yes, lazy.  Since we’re looking to use Linux for time-saving shortcuts, the less work we have to do to get to accomplish our task, the better.

As previously mentioned, I like podcasts.   Podcasts are (usually) available in an RSS feed in the form of a web URL.  Programs such as ITunes, Amarok, Rhythmbox (or other) use feed URLs to get info about the available audio files and you can manually download them or set up preferences that do this for you.

We’re going to look at this from a “get me all the files—now” approach using the Linux command line.

To perform a multi-file “unattended” download…

  1. Make sure that “lynx” (a terminal-based web browser) is installed.  To check if lynx is installed, type which lynx at the prompt.  If the shell responds with nothing but the next prompt, then it’s not installed.  To install lynx and you’re on a debian-based OS such as Ubuntu (or similar) type “sudo aptitude install lynx” at the prompt.  If you’re using a redhat-based system type “yum install lynx” to accomplish the same.   When lynx is installed, the shell will return the executable path of lynx (it might appear as /usr/local/bin/lynx) when you type “which lynx” at your prompt.
  2. Make sure you have wget installed.  In the terminal, type “which wget” and see what the shell returns.  If it’s not on your system then install it.  Items one and two only have to be done once, if at all.  I think wget will be there, but  lynx is probably not included out of the box at install time.
  3. A URL (or RSS feed URL) where the desired files exist.

Here’s our practical example.  Let’s download all the mp3 files at Steven J. Cohen’s “Doctor Who” RSS Feed. You should view this link in a web browser to make sure that the page/feed is still there.

Time for a “trial-run” (this next command will not download, just list the mp3 files at the Feed URL).

lynx -dump http://www.stevenjaycohen.com/audio/drwho/feed | egrep -o "http:.*mp3"

lynx -dump [URL] returns a numbered list of web links from a given web page (for the complete HTML source, use lynx -source [URL]). Since we only want the links, (and not the numbering) we need to filter this list using the UNIX pipe character “|” and the search tool egrep -o [pattern].  We put in “http:.*mp3” as our pattern which will capture any link that starts with http and ends with mp3 (note the .* is a wildcard meaning `any character`). A word of caution. It’s ALWAYS a good idea to do a trial run so that you have an idea of what you will request for download and if your command will succeed in building the list properly.  This is a very important preliminary step.

Now, let’s do this for real.  The following command downloads files into the current directory of the shell.  So if you execute the command from “/home/myUserName/music” then the files get saved into “music”.

lynx -dump http://www.stevenjaycohen.com/audio/drwho/feed | egrep -o "http:.*mp3" | xargs -n1 wget

And that’s it.  The shell shows progress of each file as it downloads.  When it’s done with the first file, it downloads the next one, and so on.  It runs unattended, allowing you to do other things with your time.

To perform the “unattended” download of all the files specified in the list, we needed another pipe, and another command structure known as “xargs”.  Why xargs?  Sometimes the shell runs into a problem of having “too many arguments” in its list to act on.  xargs is your friend should this happen.

xargs [options] [command].  The option and the command work together as follows.  Option “-n1” directs the command “wget” to work one time per each url from the list resulting from the “lynx -dump” part of the command.  Like many shell commands, there’s usually more than one way to do it.

Download several files: part 1

01.14.10

Posted by adamlinuxhelp  |  1 Comment »

How to use wget; download many files with one command.

A typical way to download a file is to “right-click” on it and “save as” to a folder on your computer.  Downloading a few files this way is not tedious.  But if an audio book has 25 to 30 files you can bet I don’t want to do those manual moves over and over again.

Using a terminal, there’s a faster way to download files.  I’ll introduce now one of my often-used commands: wget.  This command has many useful options.  For example, you can download files, set up custom directory structures for your download(s), or see if a file exists without actually downloading it.

Using the command (in simple terms): Open a terminal and type wget [options] [urls] at the prompt (usually a dollar sign).  You can use one or several URLs.  Options are (well…) optional.

Here’s a practical example where you can download a gif image from the O’Reilly site linked below.  When you open a Linux terminal, you are usually in your user’s “home” directory.  This is fine for the purpose of this example.  Issue the command

wget http://oreilly.com/catalog/covers/0596009305_bkt.gif

Here’s what will happen: the file 0596009305_bkt.gif gets downloaded and saved to your home folder.  Cool right? But it was a bit of work (typing) just to download one file.  How does this save me time?

Yes, the above example is overly-simplified.  You can, if you wish, download any “.gif” or “.jpg” files from a given web address in the example below.  It’s a time-saving single command, borrowed from the commandlinefu website mentioned in the “cool and advanced uses of wget” link below.

wget -r -l1 --no-parent -nH -nd -A".gif,.jpg" http://example.com/images

*Change the “example.com/images” to a valid web address.  The options above (explained) are:

  • -r for “recursive”
  • -l1 only get files in the “images” directory (don’t dive into subdirectories)
  • –no-parent and -nH and -nd : ignore directory structure (no directories—just get the files)
  • “-A” is the “accept list” for files of type [.gif and .jpg].  It’s case-sensitive, so it would not download files ending in “.JPG”, so if you needed those too, specify with -A”.gif,.jpg,.JPG”

You can find more wget info and options here.  For really cool and advanced uses of wget, see this page.

I’ll post another awesome usage of wget in another post.  Thanks for reading.