Wicked Cool Java: Crawling the Semantic Web

Get started with RDF

In this article, we examine techniques for extracting and processing data in the World Wide Web and the Semantic Web. The World Wide Web completely changed the way that people access information. Before the Web existed, finding obscure pieces of information meant taking a trip to the library, along with hours or perhaps days of research. In extreme cases, it meant calling or writing a letter to an expert and waiting for a reply. Today, not only are there Websites on every imaginable topic, but there are search engines, encyclopedias, dictionaries, maps, news, electronic books, and an incredible array of other data available online. Using search engines, we can find information on any topic within a few seconds. The Google search engine has even become so well known that it is now often used as a verb: "I Googled a solution." Online information is growing exponentially, and because of it, we have a completely new problem on our hands that is not solved by simply using keyword searches to find our data. The problem is infoglut. Keyword searches return too many documents, and most of those documents don't have the information that we want.

Suppose that we wanted to search for a Java class library that converts data from one format to another. With all the open source projects out there, someone may have already solved the problem for us, and we'd rather not reinvent the wheel. In theory, we should be able to search for matching projects that meet our needs. But running a query on related keywords may give us many results that are not related to what we really want. In an ideal world, we should be able to ask the computer a question: "Is there an open source Java API that converts between FORMAT1 and FORMAT2?" The computer should then search the Web and give us the name of a suitable API if it exists, along with a short description of the standard and links to more detailed information. For this to happen, information about a hypothetical "J-convert-1-2" API would need to be encoded in such a way that the computer can find it easily without performing a keyword search and extracting data from the text results.

Information on the World Wide Web is mostly free-form text contained in HTML pages and is mostly not organized into categories and structures that search programs can easily query. At the very least, all Web content ought to have subject indicators similar to the Library of Congress and Dewey Decimal codes for books. This is not yet the case, although it will most likely happen soon. Several new standards are rapidly leading us in that direction. So far, all of these standards rely on Web content developers adding special tags to their data, and few developers know about these standards at the present time. In short, it's a mess out there, and we're trudging through this messy data looking for nuggets of gold.

The Semantic Web is the next-generation web of concepts linked to other concepts, rather than a collection of hypertext documents linked by keywords. If you think about it, an HTML anchor tag (link) is a keyword reference to another document. It supplies a word or phrase that links to another document, usually displayed as underlined text on a browser. But the link doesn't exactly say how the two documents are related to each other. HTML hyperlinks don't give any real indication about relationships between files, and the text in the link may be extremely vague. A new standard, the Resource Description Framework (RDF), makes it possible to be much more specific about how things are related to each other. In fact, RDF describes much more than documents—any entities or concepts can be linked together. This is the basic idea behind the Semantic Web—that concepts, rather than documents, can be linked together.

As Java developers, how can we participate in building the Semantic Web? First, you'll need to know something about official standards such as RDF. You will then need to tag your documents appropriately. Many sites are already starting to do some of this by creating RDF Site Summary (RSS) feeds. An RSS feed syndicates the content from a Website so that it can be combined with information from other sites and delivered to the users as aggregated content. RSS makes a small portion of a site available as a summary, similar to what you see in an article or news abstract. However, RSS enabling is only the first step in moving toward a Semantic Web. In this article I discuss enough to get you started working with RDF and introduce some APIs that help in producing or consuming content.

This somethings that: A short introduction to N3 and Jena

The theory behind the RDF standard is actually quite simple. Everything has a Uniform Resource Identifier (URI), and, by this, I mean everything: not only documents, but also generic concepts and relationships between them. Even though you are not a document (Or are you?), there could be a URI assigned to represent you as an entity. This URI can then be used to make connections to other things. For the "you" URI, these connections might represent related organizations, addresses, and phone numbers. URIs do not have to return an actual document! This is what sometimes confuses developers when they see a URI referenced somewhere and find that there is nothing at the location. These addresses are often used as markers or unique identifiers to represent concepts. We make links between URIs to represent relationships between things. This functions much like a simple sentence in English: Programmers enjoy Java.

To begin with, let's use a shorthand notation, called N3, to encode this sentence as an RDF graph. N3 is an easy way to learn RDF because the syntax is only slightly more complex than the sentence above! In essence, N3 is merely a set of triples, or "subject-predicate-object" relationships. Here is the N3 version of the sentence:

 @prefix wcj: <http://example.org/wcjava/uri/> .
wcj:programmers wcj:enjoy wcj:java .

We first define a prefix to make the N3 code less verbose. The prefix is used as the beginning part of a URI wherever it is found in the document, so that wcj:java then becomes http://example.org/wcjava/uri/java (the value is also placed within < and > markers—these have nothing to do with XML). The three items together are called a triple, and the verb is usually called a predicate. RDF makes a link by stating that a subject URI is related by a predicate URI to an object URI. The predicate represents some relationship between the subject and object—it tells how things link together. This is very different than an anchor in HTML, because here a relationship type is clearly defined. Remember that URIs in RDF could be anything: concepts, documents, or even (in some cases) string literals. In theoretical terms, we are creating a labeled directed graph of the relationship. A graph representation of the above might look like Figure 1.

Figure 1: RDF subject, predicate, and object

As you might expect, there is a Java API for creating and managing RDF and N3 documents. Jena is an open source API for working with RDF graphs. Here is one way to create the graph in Jena and serialize it to an N3 document:

 

import com.hp.hpl.jena.rdf.model.*; import java.io.FileOutputStream;

Model model = ModelFactory.createDefaultModel(); Resource programmers = model.createResource("http://example.org/wcjava/uri/programmers"); Property enjoy = model.createProperty("http://example.org/wcjava/uri/enjoy"); Resource java = model.createResource("http://example.org/wcjava/uri/java"); model.add(programmers, enjoy, java); FileOutputStream outStream = new FileOutputStream("out.n3"); model.write(outStream, "N3"); outStream.close();

Here, Jena is using the term property to refer to the predicate and resource to refer to something used as a subject or object. The model's write() method also has options to write out the document in other formats besides N3. With the Jena API, you can connect many entities together into very large semantic networks. Let's make some additional relationships using the entities and relationships that we just created. We will produce the graph shown in Figure 2.

Figure 2: An RDF graph with multiple subjects

Here is the additional code to produce the network in Figure 2:

 Property typeOf =
   model.createProperty("http://example.org/wcjava/typeOf");
Property use =
   model.createProperty("http://example.org/wcjava/use");
Property understand =
   model.createProperty("http://example.org/wcjava/understand");
Resource computers =
   model.createResource("http://example.org/wcjava/computers");
Resource progLang =
   model.createResource("http://example.org/wcjava/progLang");
model.add(java, typeOf, progLang);
model.add(programmers, use, computers);
model.add(computers, understand, progLang);
model.write(new java.io.FileOutputStream("out2.n3"), "N3"); 

The N3 output of this code is the following:

 

<http://example.org/wcjava/uri/java> <http://example.org/wcjava/typeOf> <http://example.org/wcjava/progLang> .

<http://example.org/wcjava/computers> <http://example.org/wcjava/understand> <http://example.org/wcjava/progLang> .

<http://example.org/wcjava/uri/programmers> <http://example.org/wcjava/uri/enjoy> <http://example.org/wcjava/uri/java> ; <http://example.org/wcjava/use> <http://example.org/wcjava/computers> .

The semicolon in the N3 document is a shortcut that indicates we are going to attach another property to the same subject ("programmers enjoy Java, and programmers use computers"). The meanings of elements within a document are often defined in terms of a predefined set of resources and properties called a vocabulary. Your RDF data can be combined with other data in existing vocabularies to allow semantic searches and analysis of complex RDF graphs. In the next section, I illustrate how to build upon existing RDF vocabularies to build your own vocabulary.

Triple the fun: Creating an RDF vocabulary for your organization

An RDF graph creates a web of concepts. It makes assertions about logical relationships between entities. RDF was meant to fit into a dynamic knowledge representation system rather than a static database structure. Once you have information in RDF, it can be linked with graphs made elsewhere, and software can use this to make inferences. If you define how your own items are related in terms of higher-level concepts, your data can fit into a much larger web of concepts. This is the basis of the Semantic Web.

Every organization has relationships between information that is held in a datastore such as a database or flat file (or human memory!). If your data is in a relational database, your data items probably have relationships between them that are hidden or implied within the database structure itself. Your data may not be completely accessible, because there are relationships that an application cannot query. As an example, suppose that we have a relational database containing employees and departments within a company. A common approach is to create an Employee table, with columns for employee information such as ID number, date of birth, name, hire date, supervisor name, and department. There are many relationships hidden within the table and column names, and it is up to an application to know these relationships and take advantage of them. Column names alone would not give you the following information:

  • A and B are employees
  • An employee is a person
  • A supervisor is an employee who directs another employee
  • C is a company
  • A company is an organization
  • A and B work for C

Column and table names in a database are simply local identifiers and don't automatically map to any concepts that might be defined elsewhere. But this is domain knowledge that could be used more effectively by the application if it were defined in an extensible and machine-readable way. Having such information available would give our applications more flexibility, and this knowledge could also be reused elsewhere. How can we encode this information so that applications can make use of these relationships? And how can our application relate this to other information that we might find on the Semantic Web?

It may not make sense to put this metadata in your database, but you can create an RDF mapping outside the database schema that describes each item relative to the Semantic Web as a whole. We can represent some of these concepts using existing vocabularies. The rest of them we can define in our own terms. If you don't know where to connect a concept to an existing vocabulary, you can always define a URI for that concept now and make the connection to other systems later. At least you can use it to share data within your own organization if your vocabulary is well documented and the meaning of each item is clear. There are many basic vocabularies that RDF applications can use, and new ones are constantly being created (like yours!).

1 2 Page 1
Page 1 of 2