Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
First, a brief introduction to URLs (Uniform Resource Locators) would not be out of place. The general form of a URL is:
protocol://machinename[:port]/filename[#referenve].
An absolute URL -- such as http://java.sun.com/products/jdk1.2 -- has all the components required to identify the resource on the Web. In relative URLs, the protocol and the machine name are inherited from the base URL embedded in the document (base tag) or from the URL used to retrieve the document. For example, assume that you have downloaded an HTML document using the URL http://www.somesite.com/index.html and that this document has a link home.html. The link actually points to http://www.somesite.com/home.html. For more information, please see Resources.
The utility I describe in this article uses the URL class in the java.net package. The class provides three methods to obtain data from the URL. In this utility, I use the method public final InputStream openStream() throws IOException to establish a connection with the URL and to return an InputStream object to get the data from the URL. Note that the data does not contain any of the HTTP headers. This method hides all the
intricacies of setting the appropriate parameters to make a connection and connecting to the remote resource. It returns the
InputStream, which helps you to get the data as you would get any other file stream.
Some of the commonly used protocols are HTTP, FTP, Gopher, and News. This article deals only with HTTP (HyperText Transfer Protocol), an application-level protocol commonly used to transfer hypertext documents across the Internet. HTTP has gained importance because of its simplicity and low overhead.
This utility lets you download all the pages of a Website in a graceful manner. It follows these simple steps:
The utility consists of four classes: DownloadSite, Downloader, URLlist, and ExtendedURL. You can download the source code from Resources.
DownloadSite class reads the command line arguments and does some initialization. It contains the main() method. This utility takes at least one but no more than two arguments. The first argument is the site name, and the second
is the location of the new directory created to hold the downloaded files. If you do not specify the second argument, the
files are downloaded into the current directory.If you need to use this utility behind a firewall, the changes should be done in DownloadSite. See Resources for information on how to access the sites when you are behind a firewall.
DownloadSite parses the command line arguments and passes them to the Downloader class, which does the actual downloading.
Downloader is the heart of the utility. This class contains the logic used to download the pages and the code to manipulate the links.You use recursion to download the pages. The logic is simple:
private void startDownload(URL u)
{
...
listOfURL = downloadAndFillVector(in, out);
/*
* downloadAndFillVector downloads the file (and also
* manipulates the link) and returns a vector of URLs
* in the file specified by URL u.
* After the execution of this statement, listOfURL contains
* the URLs in the current page that needs to be downloaded.
*/
...
sizeOfVector = listOfURL.size();
for(int i = 0; i < sizeOfVector; i++)
startDownload((URL)listOfURL.elementAt(i));
/*
* Loop through all the elements of the vector and
* call startDownload recursively. The process repeats
* downloading all the pages
*/
}
I should explain two private members of this class: private String hostName and static Vector URLs:
hostName contains the machine name from the first page's URL (the URL provided at the command line). In any page, you can have two
types of links: absolute and relative. If the link is relative, use this hostName to retrieve the document. But if the link is absolute, you must check whether or not the host name in the link is the same
as hostName. If it is, include this link in the list of URLs to be downloaded. If it isn't, ignore this link. For example, if you are
downloading a site, say www.somesite.com, and one of its pages contains a link to www.othersite.com, you do not want to download
pages from www.othersite.com.
URLs is the global vector where you keep adding all the pages you download. When you get a link, check whether or not the link
is already present in URLs. This prevents you from downloading a page twice. Another common scenario: Often a page a.html can link to another page b.html,
and the page b.html can also have a link to a.html (from the Back button). static Vector URLs also helps you avoid falling into such loops.
To download text files and binary files, you must have separate methods for each. From the file extensions, decide whether
the file is a binary (image) file or a text file. The method nonTextFile() returns true if the file is not a text file. For efficiency, call a different method, downloadNonTextFile(), to download binary files. This function does not perform any file parsing.