Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Download a Website for offline browsing

Use common Java classes to build an offloading utility

  • Print
  • Feedback
In this article, I guide you through the steps involved in designing a utility to download a Website. This utility downloads only text and image files, but it can easily be extended to download files of any type. At the end of the article I'll provide tips on how you can extend the utility.

First, a brief introduction to URLs (Uniform Resource Locators) would not be out of place. The general form of a URL is:

protocol://machinename[:port]/filename[#referenve].


An absolute URL -- such as http://java.sun.com/products/jdk1.2 -- has all the components required to identify the resource on the Web. In relative URLs, the protocol and the machine name are inherited from the base URL embedded in the document (base tag) or from the URL used to retrieve the document. For example, assume that you have downloaded an HTML document using the URL http://www.somesite.com/index.html and that this document has a link home.html. The link actually points to http://www.somesite.com/home.html. For more information, please see Resources.

The utility I describe in this article uses the URL class in the java.net package. The class provides three methods to obtain data from the URL. In this utility, I use the method public final InputStream openStream() throws IOException to establish a connection with the URL and to return an InputStream object to get the data from the URL. Note that the data does not contain any of the HTTP headers. This method hides all the intricacies of setting the appropriate parameters to make a connection and connecting to the remote resource. It returns the InputStream, which helps you to get the data as you would get any other file stream.

Some of the commonly used protocols are HTTP, FTP, Gopher, and News. This article deals only with HTTP (HyperText Transfer Protocol), an application-level protocol commonly used to transfer hypertext documents across the Internet. HTTP has gained importance because of its simplicity and low overhead.

The main idea

Suppose you visit a Webpage containing links to several other pages that, in turn, have links to still other pages. You want to download all those pages onto your hard disk. How would you accomplish this? You could simply visit all the pages and save them on your hard disk, right? However, that is not only a tedious process but also an inconvenient one. The links in the pages may not be pointing correctly (relative to the location of other pages you are downloading), or the links might be absolute URLs pointing to the remote machine (in which case, downloading the page becomes useless). You could manipulate the links manually, but that would also be a painful process.

This utility lets you download all the pages of a Website in a graceful manner. It follows these simple steps:

  1. It downloads a page and stores all the links inside a vector
  2. It loops (or iterates) over all the elements of the vector, repeating Step 1 and Step 2 recursively


The utility consists of four classes: DownloadSite, Downloader, URLlist, and ExtendedURL. You can download the source code from Resources.

DownloadSite

The DownloadSite class reads the command line arguments and does some initialization. It contains the main() method. This utility takes at least one but no more than two arguments. The first argument is the site name, and the second is the location of the new directory created to hold the downloaded files. If you do not specify the second argument, the files are downloaded into the current directory.

If you need to use this utility behind a firewall, the changes should be done in DownloadSite. See Resources for information on how to access the sites when you are behind a firewall.

DownloadSite parses the command line arguments and passes them to the Downloader class, which does the actual downloading.

Downloader

Downloader is the heart of the utility. This class contains the logic used to download the pages and the code to manipulate the links.

You use recursion to download the pages. The logic is simple:

    private void startDownload(URL u)
    {
        ...
        listOfURL = downloadAndFillVector(in, out);
        /*
         * downloadAndFillVector downloads the file (and also
         * manipulates the link) and returns a vector of URLs
         * in the file specified by URL u.
         * After the execution of this statement, listOfURL contains
         * the URLs in the current page that needs to be downloaded.
         */
        ...
        sizeOfVector = listOfURL.size();
        for(int i = 0; i < sizeOfVector; i++)
            startDownload((URL)listOfURL.elementAt(i));
        /*
         * Loop through all the elements of the vector and
         * call startDownload recursively. The process repeats
         * downloading all the pages
         */
    }


I should explain two private members of this class: private String hostName and static Vector URLs:

  • hostName contains the machine name from the first page's URL (the URL provided at the command line). In any page, you can have two types of links: absolute and relative. If the link is relative, use this hostName to retrieve the document. But if the link is absolute, you must check whether or not the host name in the link is the same as hostName. If it is, include this link in the list of URLs to be downloaded. If it isn't, ignore this link. For example, if you are downloading a site, say www.somesite.com, and one of its pages contains a link to www.othersite.com, you do not want to download pages from www.othersite.com.
  • URLs is the global vector where you keep adding all the pages you download. When you get a link, check whether or not the link is already present in URLs. This prevents you from downloading a page twice. Another common scenario: Often a page a.html can link to another page b.html, and the page b.html can also have a link to a.html (from the Back button). static Vector URLs also helps you avoid falling into such loops.


To download text files and binary files, you must have separate methods for each. From the file extensions, decide whether the file is a binary (image) file or a text file. The method nonTextFile() returns true if the file is not a text file. For efficiency, call a different method, downloadNonTextFile(), to download binary files. This function does not perform any file parsing.

  • Print
  • Feedback
What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?

Resources