|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 3 of 4
The Fork/Join framework introduced in Java 7 is actually an implementation of the Divide and Conquer algorithm (see Resources), in which a central ForkJoinPool executes branching ForkJoinTasks. For this example we'll use a ForkJoinPool "backed" by 64 threads. I say backed because ForkJoinTasks are lighter than threads. In Fork/Join, a large number of tasks can be hosted by a smaller number of threads.
Similar to the Java 6 implementation, we start by instantiating in the WebCrawler7 constructor a ForkJoinPool object backed by 64 threads.
package insidecoding.webcrawler7;
import java.util.Collection;
import java.util.Collections;
import java.util.concurrent.ForkJoinPool;
import insidecoding.webcrawler7.net.LinkFinderAction;
import java.util.HashSet;
/**
*
* @author Madalin Ilie
*/
public class WebCrawler7 implements LinkHandler {
private final Collection<String> visitedLinks = Collections.synchronizedSet(new HashSet<String>());
// private final Collection<String> visitedLinks = Collections.synchronizedList(new ArrayList<>());
private String url;
private ForkJoinPool mainPool;
public WebCrawler7(String startingURL, int maxThreads) {
this.url = startingURL;
mainPool = new ForkJoinPool(maxThreads);
}
private void startCrawling() {
mainPool.invoke(new LinkFinderAction(this.url, this));
}
@Override
public int size() {
return visitedLinks.size();
}
@Override
public void addVisited(String s) {
visitedLinks.add(s);
}
@Override
public boolean visited(String s) {
return visitedLinks.contains(s);
}
/**
* @param args the command line arguments
*/
public static void main(String[] args) throws Exception {
new WebCrawler7("http://www.javaworld.com", 64).startCrawling();
}
}
Note that the LinkHandler interface in Listing 4 is almost the same as the Java 6 implementation from Listing 2. It's only missing the queueLink() method. The most important methods to look at are the constructor and the startCrawling() method. In the constructor, we create a new ForkJoinPool backed by 64 threads. (I've chosen 64 threads instead of 50 or some other round number because in the ForkJoinPool Javadoc it states that the number of threads must be a power of two.) The pool invokes a new LinkFinderAction, which will recursively invoke further ForkJoinTasks. Listing 5 shows the LinkFinderAction class:
package insidecoding.webcrawler7.net;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.RecursiveAction;
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import insidecoding.webcrawler7.LinkHandler;
/**
*
* @author Madalin Ilie
*/
public class LinkFinderAction extends RecursiveAction {
private String url;
private LinkHandler cr;
/**
* Used for statistics
*/
private static final long t0 = System.nanoTime();
public LinkFinderAction(String url, LinkHandler cr) {
this.url = url;
this.cr = cr;
}
@Override
public void compute() {
if (!cr.visited(url)) {
try {
List<RecursiveAction> actions = new ArrayList<RecursiveAction>();
URL uriLink = new URL(url);
Parser parser = new Parser(uriLink.openConnection());
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
for (int i = 0; i < list.size(); i++) {
LinkTag extracted = (LinkTag) list.elementAt(i);
if (!extracted.extractLink().isEmpty()
&& !cr.visited(extracted.extractLink())) {
actions.add(new LinkFinderAction(extracted.extractLink(), cr));
}
}
cr.addVisited(url);
if (cr.size() == 1500) {
System.out.println("Time for visit 1500 distinct links= " + (System.nanoTime() - t0));
}
//invoke recursively
invokeAll(actions);
} catch (Exception e) {
//ignore 404, unknown protocol or other server errors
}
}
}
}
The application logic so far is the same as it was in the Java 6 implementation. The difference in the code is that instead
of manually queuing the new links through the LinkHandler class, we submit them to the ForkJoinPool through the invokeAll() static method. Note the invokeAll(actions) line. The ForkJoinPool will schedule these tasks in the best possible way using the available 64 threads. A recursive action is over when the submitted
link has been visited (see if (!cr.visited(url))).
Now it's time to compare benchmarks. I accounted for JVM warmup when timing the two different implementations: first I ran
each program 10 times ignoring the results, then I ran it again 10 times again to compute an average timing. Between running
the Java 6 and Java 7 code I also called System.gc() numerous times to manually activate the garbage collector. I invoked both applications using the JVM flags -d64 -Xmx1512m, thus setting the platform to 64 bits and the maximum heap size to 1512 MB (see Resources).
I ran the tests on a Windows 7 SP1 64-bit machine, Intel Core i5 @2.67 GHz with 4,00 GB of RAM. I have installed the 64-bit version of JDK 7 update 2.
The timing of the Java 6 code is as follows (an average of all 10 runs):
Time to visit 1,500 distinct links: 45,404,628,454 nanoseconds Fastest time: 43,989,514,242 nanoseconds Slowest time: 47,077,714,098 nanoseconds
And here's the timing for the Java 7 implementation:
Time to visit 1,500 distinct links: 45,269,306,013 nanoseconds Fastest time: 42,365,714,625 nanoseconds Slowest time: 59,042,391,887 nanoseconds
As you can see, when accounting for search coverage (tasked with following 1,500 distinct links) there's not much difference between the two implementations.
In order to to test the second scenario I had to make some adjustments to both implementations. In both the WebCrawler6 and WebCrawler7 classes, I uncommented the synchronized List and commented the synchronized Set. For a benchmark based on following non-distinct links the Set implementation isn't required, but the List is.
// private final Collection<String> visitedLinks = Collections.synchronizedSet(new HashSet<String>());
private final Collection<String> visitedLinks = Collections.synchronizedList(new ArrayList<String>());
I also changed the visited() method to always return false, because for this benchmark it doesn't matter whether a link has been visited or not.
@Override
public boolean visited(String s) {
return false;//visitedLinks.contains(s);
}
Finally, I changed the conditions in the LinkFinder and LinkFinderAction classes to check for 3,000 links instead of 1,500:
if (cr.size() == 3000) {
System.out.println("Time for visit 3000 non-distinct links= " + (System.nanoTime() - t0));
}
The resulting benchmarks show that Fork/Join fared better when measuring processing power -- i.e., how many links each application processed per second.
Here's the timing of the Java 6 code, an average of the results for all 10 runs:
Time to visit 3,000 non-distinct links: 48,510,285,967 nanoseconds Fastest time: 44,189,380,355 nanoseconds Slowest time: 52,132,053,413 nanoseconds
This measurement is equivalent to 61.8425 links per second.
And here's the timing for the program written using Java 7:
Time to visit 3,000 non-distinct links: 31,343,446,584 nanoseconds
Fastest time: 30,533,600,312 nanoseconds
Slowest time: 33,308,851,937 nanoseconds
This is equivalent to 95.7137 links per second.
java.util.concurrent and Fork/Join, maintains the Concurrency JSR-166 Interest Site.