|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
The Fork/Join library introduced in Java 7 extends the existing Java concurrency package with support for hardware parallelism,
a key feature of multicore systems. In this Java Tip Madalin Ilie demonstrates the performance impact of replacing the Java
6 ExecutorService class with Java 7's ForkJoinPool in a web crawler application.
Web crawlers, also known as web spiders, are key to the success of search engines. These programs perpetually scan the web, gathering up millions of pages of data and sending it back to search-engine databases. The data is then indexed and processed algorithmically, resulting in faster, more accurate search results. While they are most famously used for search optimization, web crawlers also can be used for automated tasks such as link validation or finding and returning specific data (such as email addresses) in a collection of web pages.
Architecturally, most web crawlers are high-performance multithreaded programs, albeit with relatively simple functionality and requirements. Building a web crawler is therefore an interesting way to practice, as well as compare, multithreaded, or concurrent, programming techniques.
Java Tips are short, code-driven articles that invite JavaWorld readers to share their programming skills and discoveries. Let us know if you have a tip to share with the JavaWorld community. Also check out the Java Tips Archive for more programming tips from your peers.
In this article I'll walk through two approaches to writing a web crawler: one using the Java 6 ExecutorService, and the other Java 7's ForkJoinPool. In order to follow the examples, you'll need to have (as of this writing) Java 7 update 2 installed in your development environment, as well as the third-party library HtmlParser.
The ExecutorService class is part of the java.util.concurrent revolution introduced in Java 5 (and part of Java 6, of course), which simplified thread-handling on the Java platform. ExecutorService is an Executor that provides methods to manage the progress-tracking and termination of asynchronous tasks. Prior to the introduction of
java.util.concurrent, Java developers relied on third-party libraries or wrote their own classes to manage concurrency in their programs.
Fork/Join, introduced in Java 7, isn't intended to replace or compete with the existing concurrency utility classes; instead it updates and completes them. Fork/Join addresses the need for divide-and-conquer, or recursive task-processing in Java programs (see Resources).
Fork/Join's logic is very simple: (1) separate (fork) each large task into smaller tasks; (2) process each task in a separate thread (separating those into even smaller tasks if necessary); (3) join the results.
The two web crawler implementations that follow are simple programs that demonstrate the features and functionality of the
Java 6 ExecutorService and the Java 7 ForkJoinPool.
Our web crawler's task will be to find and follow links. Its purpose could be link validation, or it could be gathering data. (You might, for instance, instruct the program to search the web for pictures of Angelina Jolie, or Brad Pitt.)
java.util.concurrent and Fork/Join, maintains the Concurrency JSR-166 Interest Site.