This article will look at some of the issues involved in designing and implementing such a Web crawler-type program.
Before we tackle the design of the program, let's review what Java provides in the field of network programming, since the program will necessarily delve into technical networking matters. Java happens to be wonderfully suited for all but the lowest-level TCP/IP programming tasks: The standard java.net package features a small but functional set of classes encapsulating TCP/IP in a way that greatly reduces potential network programming pitfalls. The java.net classes really let you concentrate on your application, rather than on myriad protocol programming details.
The main java.net classes used for doing any TCP/IP programming are:
Using these three simple classes, you can write a host (no pun intended) of interesting Internet client software. Class Socket is your main gateway to the Internet: Given some host's address in the form of an InetAddress instance, and a port number to connect to, class Socket will hand you two I/O streams to communicate with your chosen server
in full duplex (via one input stream and one output stream). Class URL allows both addressing and accessing of Web resources. This class therefore can be considered a higher-level combination
of Socket and InetAddress but with a bias toward Web applications.
If you need to write server programs, need finer control over HTTP (WWW) exchanges, or need to use the User Datagram Protocol (UDP), then the following additional classes will help you in your endeavors:
Before starting any network projects, bear in mind that Java applications and Java applets get different licenses to use these classes. As you probably know by now, applets are severely crippled when it comes to networking capabilities: For security reasons applets are restricted to communicating only with the server that produced (served) the applet itself. The Web exploration program we're going to write needs full access to the Web, so by necessity it will be a Java application and not an applet.
Conceptually, our Web exploration program is very similar to a textbook (recursive) file system lister (phew, what a mouthful), just like the ubiquitous DIRs or ls-es we all know. But instead of a directory specification, our program takes any Web page as sole argument. This page will act as the root for a hypertext graph to be explored. Every link (embedded in HTML <A> anchor tags) in a page leads to more pages that need to be visited. While the recursive algorithm in programs like DIR or ls is purely sequential in nature, there's an opportunity to design our Web explorer to be more efficient by harnessing the inherent parallelism of the Internet. Like all Web browsers, a Web explorer can download different Web pages at the same time, thus exploiting the full bandwidth possible with any given link.