Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Java Tip 144: When to use ForkJoinPool vs ExecutorService

Know your options for multithreaded programming on the Java platform

  • Print
  • Feedback

Page 2 of 4

The application architecture consists of the following:

  1. An interface that exposes basic operations to interact with links; i.e., get the number of visited links, add new links to be visited in queue, mark a link as visited
  2. An implementation for this interface that will also be the starting point of the application
  3. A thread/recursive action that will hold the business logic to check whether a link has already been visited. If not, it will gather all the links in the corresponding page, create a new thread/recursive task, and submit it to the ExecutorService or ForkJoinPool
  4. An ExecutorService or ForkJoinPool to handle waiting tasks

Note that a link is considered "visited" after all links in the corresponding page have been returned.

In addition to comparing ease of development using the concurrency tools available in Java 6 and Java 7, we'll compare application performance based on two benchmarks:

  • Search coverage: Measures the time required to visit 1,500 distinct links
  • Processing power: Measures the time in seconds required to visit 3,000 non-distinct links; this is like measuring how many kilobits per second your Internet connection processes.

While relatively simple, these benchmarks will provide at least a small window into the performance of Java concurrency in Java 6 versus Java 7 for certain application requirements.

A Java 6 web crawler built with ExecutorService

For the Java 6 web crawler implementation we'll use a fixed-thread pool of 64 threads, which we create by calling the Executors.newFixedThreadPool(int) factory method. Listing 1 shows the main class implementation.

Listing 1. Constructing a WebCrawler

package insidecoding.webcrawler;

import java.util.Collection;
import java.util.Collections;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import insidecoding.webcrawler.net.LinkFinder;
import java.util.HashSet;

/**
 *
 * @author Madalin Ilie
 */
public class WebCrawler6 implements LinkHandler {

    private final Collection<String> visitedLinks = Collections.synchronizedSet(new HashSet<String>());
//    private final Collection<String> visitedLinks = Collections.synchronizedList(new ArrayList<String>());    
    private String url;
    private ExecutorService execService;

    public WebCrawler6(String startingURL, int maxThreads) {
        this.url = startingURL;
        execService = Executors.newFixedThreadPool(maxThreads);
    }

    @Override
    public void queueLink(String link) throws Exception {
        startNewThread(link);
    }

    @Override
    public int size() {
        return visitedLinks.size();
    }

    @Override
    public void addVisited(String s) {
        visitedLinks.add(s);
    }

    @Override
    public boolean visited(String s) {
        return visitedLinks.contains(s);
    }

    private void startNewThread(String link) throws Exception {
        execService.execute(new LinkFinder(link, this));
    }

    private void startCrawling() throws Exception {
        startNewThread(this.url);
    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws Exception {
        new WebCrawler("http://www.javaworld.com", 64).startCrawling();
    }
}

In the above WebCrawler6 constructor, we create a fixed-size thread pool of 64 threads. We then start the program by calling the startCrawling method, which creates the first thread and submits it to the ExecutorService.

Next, we create a LinkHandler interface, which exposes helper methods to interact with URLs. Requirements are as follows: (1) mark a URL as visited using the addVisited() method; (2) get the number of the visited URLs through the size() method; (3) determine whether a URL has been already visited using the visited() method; and (4) add a new URL in the queue through the queueLink() method.

Listing 2. The LinkHandler interface

package insidecoding.webcrawler;

/**
 *
 * @author Madalin Ilie
 */
public interface LinkHandler {

    /**
     * Places the link in the queue
     * @param link
     * @throws Exception
     */
    void queueLink(String link) throws Exception;

    /**
     * Returns the number of visited links
     * @return
     */
    int size();

    /**
     * Checks if the link was already visited
     * @param link
     * @return
     */
    boolean visited(String link);

    /**
     * Marks this link as visited
     * @param link
     */
    void addVisited(String link);
}

Now, as we crawl pages, we need to start up the rest of the threads, which we do via the LinkFinder interface, as shown in Listing 3. Note the linkHandler.queueLink(l) line.

Listing 3. LinkFinder

package insidecoding.webcrawler.net;

import java.net.URL;
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import insidecoding.webcrawler.LinkHandler;

/**
 *
 * @author Madalin Ilie
 */
public class LinkFinder implements Runnable {

    private String url;
    private LinkHandler linkHandler;
    /**
     * Used fot statistics
     */
    private static final long t0 = System.nanoTime();

    public LinkFinder(String url, LinkHandler handler) {
        this.url = url;
        this.linkHandler = handler;
    }

    @Override
    public void run() {
        getSimpleLinks(url);
    }

    private void getSimpleLinks(String url) {
        //if not already visited
        if (!linkHandler.visited(url)) {
            try {
                URL uriLink = new URL(url);
                Parser parser = new Parser(uriLink.openConnection());
                NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
                List<String> urls = new ArrayList<String>();

                 for (int i = 0; i < list.size(); i++) {
                    LinkTag extracted = (LinkTag) list.elementAt(i);

                    if (!extracted.getLink().isEmpty()
                            && !linkHandler.visited(extracted.getLink())) {

                        urls.add(extracted.getLink());
                    }

                }
                //we visited this url
                linkHandler.addVisited(url);

                if (linkHandler.size() == 1500) {
                    System.out.println("Time to visit 1500 distinct links = " + (System.nanoTime() - t0));                   
                }

                for (String l : urls) {
                    linkHandler.queueLink(l);
                }

             } catch (Exception e) {
                //ignore all errors for now
            }
        }
    }
}

The logic of the LinkFinder is simple: (1) we start parsing a URL; (2) after we gather all the links within the corresponding page, we mark the page as visited; and (3) we send each found link to a queue by calling the queueLink() method. This method will actually create a new thread and send it to the ExecutorService. If "free" threads are available in the pool, the thread will be executed; otherwise it will be placed in a waiting queue. After we reach 1,500 distinct links visited, we print the statistics and the program continues to run.

  • Print
  • Feedback

Resources