|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 2 of 4
The application architecture consists of the following:
ExecutorService or ForkJoinPoolExecutorService or ForkJoinPool to handle waiting tasks
Note that a link is considered "visited" after all links in the corresponding page have been returned.
In addition to comparing ease of development using the concurrency tools available in Java 6 and Java 7, we'll compare application performance based on two benchmarks:
While relatively simple, these benchmarks will provide at least a small window into the performance of Java concurrency in Java 6 versus Java 7 for certain application requirements.
For the Java 6 web crawler implementation we'll use a fixed-thread pool of 64 threads, which we create by calling the Executors.newFixedThreadPool(int) factory method. Listing 1 shows the main class implementation.
package insidecoding.webcrawler;
import java.util.Collection;
import java.util.Collections;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import insidecoding.webcrawler.net.LinkFinder;
import java.util.HashSet;
/**
*
* @author Madalin Ilie
*/
public class WebCrawler6 implements LinkHandler {
private final Collection<String> visitedLinks = Collections.synchronizedSet(new HashSet<String>());
// private final Collection<String> visitedLinks = Collections.synchronizedList(new ArrayList<String>());
private String url;
private ExecutorService execService;
public WebCrawler6(String startingURL, int maxThreads) {
this.url = startingURL;
execService = Executors.newFixedThreadPool(maxThreads);
}
@Override
public void queueLink(String link) throws Exception {
startNewThread(link);
}
@Override
public int size() {
return visitedLinks.size();
}
@Override
public void addVisited(String s) {
visitedLinks.add(s);
}
@Override
public boolean visited(String s) {
return visitedLinks.contains(s);
}
private void startNewThread(String link) throws Exception {
execService.execute(new LinkFinder(link, this));
}
private void startCrawling() throws Exception {
startNewThread(this.url);
}
/**
* @param args the command line arguments
*/
public static void main(String[] args) throws Exception {
new WebCrawler("http://www.javaworld.com", 64).startCrawling();
}
}
In the above WebCrawler6 constructor, we create a fixed-size thread pool of 64 threads. We then start the program by calling the startCrawling method, which creates the first thread and submits it to the ExecutorService.
Next, we create a LinkHandler interface, which exposes helper methods to interact with URLs. Requirements are as follows: (1) mark a URL as visited using
the addVisited() method; (2) get the number of the visited URLs through the size() method; (3) determine whether a URL has been already visited using the visited() method; and (4) add a new URL in the queue through the queueLink() method.
package insidecoding.webcrawler;
/**
*
* @author Madalin Ilie
*/
public interface LinkHandler {
/**
* Places the link in the queue
* @param link
* @throws Exception
*/
void queueLink(String link) throws Exception;
/**
* Returns the number of visited links
* @return
*/
int size();
/**
* Checks if the link was already visited
* @param link
* @return
*/
boolean visited(String link);
/**
* Marks this link as visited
* @param link
*/
void addVisited(String link);
}
Now, as we crawl pages, we need to start up the rest of the threads, which we do via the LinkFinder interface, as shown in Listing 3. Note the linkHandler.queueLink(l) line.
package insidecoding.webcrawler.net;
import java.net.URL;
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import insidecoding.webcrawler.LinkHandler;
/**
*
* @author Madalin Ilie
*/
public class LinkFinder implements Runnable {
private String url;
private LinkHandler linkHandler;
/**
* Used fot statistics
*/
private static final long t0 = System.nanoTime();
public LinkFinder(String url, LinkHandler handler) {
this.url = url;
this.linkHandler = handler;
}
@Override
public void run() {
getSimpleLinks(url);
}
private void getSimpleLinks(String url) {
//if not already visited
if (!linkHandler.visited(url)) {
try {
URL uriLink = new URL(url);
Parser parser = new Parser(uriLink.openConnection());
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
List<String> urls = new ArrayList<String>();
for (int i = 0; i < list.size(); i++) {
LinkTag extracted = (LinkTag) list.elementAt(i);
if (!extracted.getLink().isEmpty()
&& !linkHandler.visited(extracted.getLink())) {
urls.add(extracted.getLink());
}
}
//we visited this url
linkHandler.addVisited(url);
if (linkHandler.size() == 1500) {
System.out.println("Time to visit 1500 distinct links = " + (System.nanoTime() - t0));
}
for (String l : urls) {
linkHandler.queueLink(l);
}
} catch (Exception e) {
//ignore all errors for now
}
}
}
}
The logic of the LinkFinder is simple: (1) we start parsing a URL; (2) after we gather all the links within the corresponding page, we mark the page
as visited; and (3) we send each found link to a queue by calling the queueLink() method. This method will actually create a new thread and send it to the ExecutorService. If "free" threads are available in the pool, the thread will be executed; otherwise it will be placed in a waiting queue.
After we reach 1,500 distinct links visited, we print the statistics and the program continues to run.
java.util.concurrent and Fork/Join, maintains the Concurrency JSR-166 Interest Site.