An AI tool for the real world

Knowledge modeling with Protégé

Knowledge about the application domain is one of the most important cornerstones of successful software projects. You must gather at least a basic understanding of the concepts relevant to your customers before you can begin coding. For example, you need to know how your customer's business processes work before you can develop a warehouse management system; you need to know that users who buy cat food might also be interested in cat litter before you can implement purchase recommendations for an online shop; and you need to know that a Quillflinger is a monster that flings quills before you develop a role-playing game.

We acquire such knowledge from domain experts and capture it in some kind of domain model. In simple cases, we can scribble these models on paper. This approach works fine for small projects and when the experts help us decipher their handwriting. But it's better to have models that directly translate into a Java program. For instance, we can use Unified Modeling Language (UML) to sketch the domain models with class diagrams and use cases. UML is quite good for quickly getting to an implementation, but it is basically a language for object-oriented programming that few domain experts fully understand. And it consists of a fixed set of modeling constructs (such as classes and attributes) that are not very useful when domain experts would rather talk about specific business processes, products, and monsters.

If you want to more closely involve your experts and customers in the development process, you need more than UML. In this article, you will learn how to use Protégé, a simple yet powerful tool optimized for building domain models. Although Protégé was originally developed 15 years ago to support knowledge acquisition for rather specialized medical expert systems, it has also become very popular for many other purposes. Protégé is open source and currently has more than 7,500 registered users.

In a nutshell, you can use Protégé for the following:

  • Class modeling. Protégé provides a graphical user interface (GUI) that models classes (domain concepts) and their attributes and relationships.
  • Instance editing. From these classes, Protégé automatically generates interactive forms that enable you or domain experts to enter valid instances.
  • Model processing. Protégé has a library of plug-ins that help you define semantics, ask queries, and define logical behavior.
  • Model exchange. The resulting models (classes and instances) can be loaded and saved in various formats, including XML, UML, and RDF (Resource Description Framework). Protégé also provides a very scalable database back end.

From a programmer's perspective, one of Protégé's most attractive features is that it provides an open source API to plug in your own Java components and access the domain models from your application. As a result, you can develop systems very rapidly: just start with the underlying domain model, let Protégé generate the basic user interface, and then gradually write widgets and plug-ins to customize look-and-feel and behavior. You can even give Protégé to your customers and, with little training, let them build their own knowledge and requirement models.

Get started

I walk you through an example project to demonstrate how Protégé works and what else you can do with it. You can download all relevant files for this project from Resources and play with the tool while you read.

Let's assume our task is to develop a system that helps manage the articles and authors for an online magazine like JavaWorld. Articles are categorized by means of a topical index, consisting of keywords like "Swing" or "Design Patterns." Our system uses this index to propose related articles to the magazine's readers. The readers can provide feedback on the articles and rate their quality. The system uses this information to help editors decide whether submitted articles are worth publishing. This decision might depend on the ratings that previous articles by the author have received and whether articles with related topics have been recently published.

Install Protégé

Protégé is the result of various artificial intelligence (AI) and knowledge-modeling projects from the Medical Informatics group at Stanford University. The Protégé Website provides documentation, tutorials, and an extremely active discussion list. You can report problems and find a plug-in library, a collection of domain models, and the Protégé software.

Installers for all major platforms are available on Protégé's download page. To run Protégé (version 1.8), you need a Java 2 Platform, Standard Edition (J2SE) virtual machine (version 1.3 or above). You can choose to automatically install a suitable virtual machine from the Website. For this tutorial, don't forget to download the example project and extract it into a folder such as the examples folder from your Protégé installation.

When you start Protégé, the Welcome screen lets you choose to open an existing project or create a new one. Click on "Open other..." and select the Online Magazines.pprj project.

Protégé's main window consists of tabs that display the knowledge model's various aspects. You will see later that you can add additional tabs from a library or even develop your own tab components and plug them into Protégé.

Classes and slots

The most important tab when you start a project is the Classes tab, shown in Figure 1. In Protégé and many other knowledge-modeling tools, classes are named concepts from the domain that can have attributes and relations. Protégé classes are comparable to Java or UML classes, but without attached methods. Classes can be arranged in an inheritance hierarchy, which displays in the tree panel in the left part of the Classes tab. The properties of the tree's selected class display in the Classes tab's main area. Protégé supports multiple inheritance, and classes are abstract or concrete. Like in Java, only concrete classes have instances.

The example project (see Figure 1) has defined classes for various content types (e.g., Articles and Tips 'N Tricks), authors, readers, feedback, and a topic hierarchy used to categorize content.

Figure 1. Protégé's class editor. Click on thumbnail to view full-size image.

In Protégé, classes' attributes and their relations are called slots. A slot has a name and a value type. Protégé supports the primitive value types boolean, integer, float, and string, which are handled like they are in Java. For example, you can define the class Person and assign a slot called name to it with string as the value type. Additionally, a value type called symbol can represent enumerations of string values (e.g., the 12 different month names). Apart from primitive values, slots can also refer to the model's instances and classes. You can use slots to build relationships and associations between instances, such as between articles and their author(s). Slots store either single or multiple values.

To define a slot for your class, click on the C button above the list of template slots in the Classes tab. This action opens a dialog, shown in Figure 2. If you want an overview of all existing slots in your model, switch to the Slots tab.

Figure 2. Slots are attributes or relationships between classes. The authors slot stores the list of authors. Click on thumbnail to view full-size image.

From what we've seen so far, slots are very similar to conventional object-oriented attributes and relations. However, some important details make slot definitions richer than most object-oriented concepts. A main difference is that a slot can attach to multiple classes. In our magazine project, some but not all Contents subclasses can have subtitles, so we can define a slot subTitle and simultaneously assign it to multiple classes.

Another major difference is that you can specify constraints on slot values. Constraints restrict a slot's range of allowed values. One of these constraints restricts a slot's cardinality. You can specify the minimum and maximum number of values a slot holds. This feature is similar in UML, where you can define cardinalities like [0..1] or [0..*]. Protégé also allows you to define inverse slots and default values for slots. Furthermore, you can restrict the range of numeric slots (integer and float) by minimum and maximum values. All these constraints help you build correct domain models, because Protégé can display an instance's invalid values.

Protégé slots are global objects (i.e., they can even exist without being assigned to a class). You can either globally or individually define their properties for each slot's assigned class. For that purpose, Protégé allows you to override the slots' properties, so you can separately define value type, cardinality, and more for each class. You see the difference when you double-click on a slot in the Classes tab, where Protégé asks if you want to see the "top-level slot" or the "slot at class."

The slot restrictions mentioned so far ensure that the model's instances fulfill simple constraints. For more complex constraints, Protégé has a built-in language called Protégé Axiom Language (PAL). PAL is similar to the Object Constraint Language (OCL) in UML. In the example project, PAL tells Protégé that no online magazine reader can review the same article more than once. Although PAL may look unusual at a first glimpse, it is actually very powerful. Besides PAL, Protégé has some extensions like the JessTab (see below) that also expresses constraints and other kinds of "meaning."

Instances and forms

Now that the classes that describe our domain's concepts and their restrictions have been defined, you can use Protégé's Instances tab to define these classes' instances. Like in Java, instances are specific entities of a given class, such as a specific Article. Protégé tremendously helps in the definition of instances, because it automatically generates graphical forms that contain text fields, radio buttons, check boxes, combo boxes, lists, and other widgets to make editing as convenient as possible. Figure 3 shows a sample form.

Figure 3. Protégé's instances editor. Forms like the one on the right are automatically generated from the class definition. Click on thumbnail to view full-size image.

Using these forms, you or your domain expert can enter instances as soon as an initial draft of classes becomes available. There's no turnaround time between changing a class and getting the corresponding forms. A change to a class automatically rearranges its forms. Therefore, Protégé is excellent for rapid prototyping. You can try different variations of your class hierarchy before you start coding.

Of course the automatically generated forms are not always perfect, but you can change the types of widgets and their layout with little effort. Go to the Forms tab (shown in Figure 4) and select the class's form you wish to edit. Here, you drag and resize the widgets on the screen like in a Java IDE's GUI builder. You can specify that an author's about slot should be edited in a TextAreaWidget instead of a TextFieldWidget.

Figure 4. With Protégé, you can easily adapt the forms that edit instances by dragging widgets. Click on thumbnail to view full-size image.

Projects and ontologies

Protégé projects are sometimes called ontologies. This is an AI term similar to the notion of domain models in software development. An ontology is a collection of domain concepts and their relationships. Ontologies are represented in Protégé and in Java or UML classes, and are available in various forms. The Yahoo! Website directory is considered an ontology. Yahoo! defines concepts like "Health" and "Science" and allows us to browse Websites that match these categories or related topics.

The notion of ontologies also plays a central role in the emerging Semantic Web. The Semantic Web consists of Internet sites that provide information in a "meaningful" format for machines. Online shops provide information on their special offers so your intelligent shopping software finds the best bargains while you sleep. Ontologies store the concepts used to describe this information. For an online shop, the ontology could contain classes like Product with slots like price. Protégé can edit such ontologies and save them in various formats.

Plug-ins

The Protégé system's core is a flexible platform into which additional modules can be plugged in as needed. This mechanism ensures that you can adapt the system for your specific needs. Many of these modules, called plug-ins, were developed by Protégé user community members, some directly at Stanford. Most existing plug-ins are available from the Protégé Website, and you can build your own extensions with little effort.

Protégé currently supports three types of plug-ins: storage plug-ins, slot widgets, and tabs. Let's look at each in turn.

Storage plug-ins

A storage plug-in is a nonvisual module that saves and loads models in a certain file or database format. Protégé currently supports the following file storage formats:

  • CLIPS (C Language Integrated Production System (Protégé's standard format))
  • XML
  • XML Schema
  • RDF
  • OIL (Ontology Inference Layer)
  • DAML+OIL (DARPA (Defense Advanced Research Projects Agency) Agent Markup Language+OIL)
  • UML
  • XMI (XML Metadata Interchange (MOF (Meta-Object Facility) metamodels))

Additional support for the new Semantic Web language, OWL (Web Ontology Language), is under construction. You can also store huge ontologies into relational database tables. Some Protégé users have created ontologies with several hundreds of thousands of concepts.

Before you can use all of these services, you might need to install them from the plug-in Website. After you have done that, you will find additional file formats in the "Save in format..." dialog.

The compatibility to various formats allows you to reuse ontologies/domain models from other projects. In the online magazine example, you can reuse a computer science ontology someone else developed for the topical index. In fact, standard models for some domains already exist. Protégé projects include other projects, so you can build complex domain models out of basic building blocks.

Slot widgets

Slot widgets are graphical components such as text fields and combo boxes placed in Protégé's instance forms to view and edit a slot value. Protégé has numerous built-in slot widgets, including a sophisticated widget that displays instances and their relationships in two-dimensional graphs. Protégé's plug-in library contains additional slot widgets for specific data types such as calendar and date widgets, and components that display images, sounds, and videos.

Tab plug-ins

Tabs are GUI panels displayed as a tab in Protégé's main window. Protégé has several default tabs, including Figure 1's Classes tab, Figure 3's Instances tab, and Figure 4's Forms tab. You can enable and disable additional tabs in the Project/Configure... menu. In this menu, you can also activate tabs you download from the Protégé plug-in library. Some examples of additional tabs that currently exist follow:

Visualization tabs

  • Jambalaya provides a hierarchical ontology browser that allows for interactive editing of existing data. Its browser combines an advanced implementation of a hypertext navigation metaphor with animated panning and zooming motions over the nested graph, which provides continuous orientation and contextual cues for the user.
  • TGVizTab visualizes the classes from a model in interactive graphs, based on the popular TouchGraph library.
  • OntoViz provides a highly configurable graphical display of models in graphs similar to UML diagrams.

Project and file management tabs

  • BeanGenerator generates JavaBeans classes from a Protégé class model. The resulting beans can access Protégé domain models conveniently from your Java program, especially from intelligent software agents.
  • DataGenie enables Protégé to read arbitrary databases using the Java Database Connectivity (JDBC) interface. Generally, each database table becomes a class, and each attribute becomes a slot.
  • Prompt allows you to manage multiple domain models in Protégé, in particular to merge two models into one, to extract a part of a model, or to identify differences between a model's two versions.

Tabs for making queries and intelligent reasoning

  • Query tab is used to ask queries on the knowledge model, for example, to retrieve all articles that have a certain topic.
  • PAL constraint and query tabs provide a powerful front end for editing and evaluating expressions in the PAL. The EZPAL plug-in facilitates the acquisition of PAL expressions.
  • JessTab connects Protégé to the Java Expert System Shell (JESS), which is very useful to specify complex constraints and to define rules that derive new knowledge from existing knowledge.
  • Algernon performs forward and backward rule-based processing of Protégé knowledge bases and efficiently stores and retrieves information in ontologies and knowledge bases.
  • PrologTab is a SourceForge project that integrates a Prolog inference engine with Protégé knowledge bases.

These are some of the modules where Protégé can unfold its full AI support. With some training, you can use these plug-ins for the development of clever services. Using the JessTab, the online magazine software can automatically notify editors when a certain topic has not been covered for a long time. The system might also filter authors who have written about related topics. Of course you can implement these and similar services in pure Java, too. But Protégé makes it possible to provide an astonishing amount of features without coding. Many of these features are even accessible to nonprogrammers like your domain experts and customers.

The Protégé API

If the plug-ins list above is still not sufficient for your needs, you can easily build your own extensions. Protégé has an open source API that can implement or customize plug-ins and access Protégé models from standalone programs. Here's how it works:

Write plug-ins

A Protégé plug-in is essentially a Java class that subclasses a certain Protégé base class. Let's assume you want to add a new tab that sends an email to all authors in your system who have written about a certain topic. This tab could consist of a list where you select the topic and a button to send the mail. To do so, you just need to subclass the Protégé API class AbstractTabWidget. Since this class derives from JPanel, you add the list component and the button directly into it. Then you only need to put your new classes into a jar file and put this jar file into Protégé's plugin folder. That's it! You can activate the tab the next time you start Protégé. For details and example code, please check Protégé's Website.

The plug-in mechanism means you can include almost any other piece of Java software in Protégé, so that Protégé and your own system can share models at runtime.

Access Protégé models from Java applications

If you would rather have a standalone application without Protégé, the following code shows how easy it is to build Protégé applications. Let's assume you wish to access the online magazine's articles with a Java application. The basic program just prints all of the articles' titles and their topics. Later, you might extend this functionality to create a list of articles in HTML from a servlet, but I keep this example simple for now.

To get started, you must first include the protege.jar in your classpath. Then you import the classes and interfaces from the edu.stanford.smi.protege.model package, which provide access to Protégé models and project files:

import edu.stanford.smi.protege.model.*; 
import java.util.*; 

The class Project represents Protégé projects, and you use its constructor to load an existing project file, such as the example project. When the project loads without errors, you access the domain model with the getKnowledgeBase() method:

public class ArticlePrinter { 
    private static final String PROJECT_FILE_NAME = "..."; 
    public static void main(String[] args) {
        Collection errors = new ArrayList();
        Project project = new Project(PROJECT_FILE_NAME, errors);
        if (errors.size() == 0) {
            printArticles(project.getKnowledgeBase());
        }
        else {
            Iterator i = errors.iterator();
            while (i.hasNext()) {
                System.out.println("Error: " + i.next());
            }
        }
    }

Now you access the classes, slots, and instances from the model with the KnowledgeBase object. The Protégé class Cls represents classes, and you look up classes by their names. The getInstances() method delivers all instances of the given Protégé class. The getOwnSlotValue() methods get a given slot's value(s) for an instance:

    private static void printArticles(KnowledgeBase kb) {
        System.out.println("Articles:");
        Cls articleCls = kb.getCls("Article");
        Iterator articles = articleCls.getInstances().iterator();
        while (articles.hasNext()) {
            Instance article = (Instance) articles.next();
            String title = (String) article.getOwnSlotValue(kb.getSlot("title"));
            System.out.println("- " + title);
            Collection topics = article.getOwnSlotValues(kb.getSlot("topics"));
            Iterator it = topics.iterator();
            while(it.hasNext()) {
                Cls topic = (Cls) it.next();
                System.out.println("   topic: " + topic.getName());
            }
        }
    }

Note this example uses the generic Protégé API to access Protégé models. The Protégé class Article is stored as a Cls object, and its instances are stored as Instance objects. Some applications might enforce a closer mapping between the application and the Protégé model, so articles are stored as Java class Article instances. Protégé provides several mechanisms that generate such classes. For example, you can export your model to UML and then generate Java classes with tools like Poseidon for UML. Or you can let Protégé directly generate Java classes for it with the BeanGenerator.

Protégé in the real world

This article has given you an idea of what Protégé can do for you. Protégé is a Java tool that builds domain models. More and more software developers recognize that domain modeling is a crucial task in modern development methodologies. Recent approaches like the Model Driven Architecture (MDA) emphasize that such domain models should be designed and implemented on a high level of abstraction. In the MDA, you start with very general domain models that capture your domain concepts and business logic in an application-independent way. Then you translate these generic models into specific platforms, such as plain Java classes, Enterprise JavaBeans, or .Net components.

One of the MDA's basic assumptions is that UML diagrams can be better maintained and reused than Java code. AI technology suggests that knowledge models (ontologies) can be even better maintained and reused than UML diagrams. Protégé helps you rapidly define such models and their semantics, and automatically generates the necessary GUI elements so your domain experts can conveniently enter their knowledge. From there, let Protégé generate other models and integrate them into your Java application. Welcome to the real world!

Holger Knublauch holds a PhD in computer science and has worked in the area of knowledge modeling and applied AI since 1993. For his thesis, he developed a Java extension for knowledge modeling, including an extensible modeling tool platform. He currently works as a post-doctoral research fellow at Stanford Medical Informatics. In this position, he is responsible for various Protégé platform features, including the UML back end and support for Semantic Web languages like the forthcoming W3C (World Wide Web Consortium) standard OWL.

Learn more about this topic