An AI tool for the real world

Knowledge modeling with Protégé

Knowledge about the application domain is one of the most important cornerstones of successful software projects. You must gather at least a basic understanding of the concepts relevant to your customers before you can begin coding. For example, you need to know how your customer's business processes work before you can develop a warehouse management system; you need to know that users who buy cat food might also be interested in cat litter before you can implement purchase recommendations for an online shop; and you need to know that a Quillflinger is a monster that flings quills before you develop a role-playing game.

We acquire such knowledge from domain experts and capture it in some kind of domain model. In simple cases, we can scribble these models on paper. This approach works fine for small projects and when the experts help us decipher their handwriting. But it's better to have models that directly translate into a Java program. For instance, we can use Unified Modeling Language (UML) to sketch the domain models with class diagrams and use cases. UML is quite good for quickly getting to an implementation, but it is basically a language for object-oriented programming that few domain experts fully understand. And it consists of a fixed set of modeling constructs (such as classes and attributes) that are not very useful when domain experts would rather talk about specific business processes, products, and monsters.

If you want to more closely involve your experts and customers in the development process, you need more than UML. In this article, you will learn how to use Protégé, a simple yet powerful tool optimized for building domain models. Although Protégé was originally developed 15 years ago to support knowledge acquisition for rather specialized medical expert systems, it has also become very popular for many other purposes. Protégé is open source and currently has more than 7,500 registered users.

In a nutshell, you can use Protégé for the following:

  • Class modeling. Protégé provides a graphical user interface (GUI) that models classes (domain concepts) and their attributes and relationships.
  • Instance editing. From these classes, Protégé automatically generates interactive forms that enable you or domain experts to enter valid instances.
  • Model processing. Protégé has a library of plug-ins that help you define semantics, ask queries, and define logical behavior.
  • Model exchange. The resulting models (classes and instances) can be loaded and saved in various formats, including XML, UML, and RDF (Resource Description Framework). Protégé also provides a very scalable database back end.

From a programmer's perspective, one of Protégé's most attractive features is that it provides an open source API to plug in your own Java components and access the domain models from your application. As a result, you can develop systems very rapidly: just start with the underlying domain model, let Protégé generate the basic user interface, and then gradually write widgets and plug-ins to customize look-and-feel and behavior. You can even give Protégé to your customers and, with little training, let them build their own knowledge and requirement models.

Get started

I walk you through an example project to demonstrate how Protégé works and what else you can do with it. You can download all relevant files for this project from Resources and play with the tool while you read.

Let's assume our task is to develop a system that helps manage the articles and authors for an online magazine like JavaWorld. Articles are categorized by means of a topical index, consisting of keywords like "Swing" or "Design Patterns." Our system uses this index to propose related articles to the magazine's readers. The readers can provide feedback on the articles and rate their quality. The system uses this information to help editors decide whether submitted articles are worth publishing. This decision might depend on the ratings that previous articles by the author have received and whether articles with related topics have been recently published.

Install Protégé

Protégé is the result of various artificial intelligence (AI) and knowledge-modeling projects from the Medical Informatics group at Stanford University. The Protégé Website provides documentation, tutorials, and an extremely active discussion list. You can report problems and find a plug-in library, a collection of domain models, and the Protégé software.

Installers for all major platforms are available on Protégé's download page. To run Protégé (version 1.8), you need a Java 2 Platform, Standard Edition (J2SE) virtual machine (version 1.3 or above). You can choose to automatically install a suitable virtual machine from the Website. For this tutorial, don't forget to download the example project and extract it into a folder such as the examples folder from your Protégé installation.

When you start Protégé, the Welcome screen lets you choose to open an existing project or create a new one. Click on "Open other..." and select the Online Magazines.pprj project.

Protégé's main window consists of tabs that display the knowledge model's various aspects. You will see later that you can add additional tabs from a library or even develop your own tab components and plug them into Protégé.

Classes and slots

The most important tab when you start a project is the Classes tab, shown in Figure 1. In Protégé and many other knowledge-modeling tools, classes are named concepts from the domain that can have attributes and relations. Protégé classes are comparable to Java or UML classes, but without attached methods. Classes can be arranged in an inheritance hierarchy, which displays in the tree panel in the left part of the Classes tab. The properties of the tree's selected class display in the Classes tab's main area. Protégé supports multiple inheritance, and classes are abstract or concrete. Like in Java, only concrete classes have instances.

The example project (see Figure 1) has defined classes for various content types (e.g., Articles and Tips 'N Tricks), authors, readers, feedback, and a topic hierarchy used to categorize content.

Figure 1. Protégé's class editor. Click on thumbnail to view full-size image.

In Protégé, classes' attributes and their relations are called slots. A slot has a name and a value type. Protégé supports the primitive value types boolean, integer, float, and string, which are handled like they are in Java. For example, you can define the class Person and assign a slot called name to it with string as the value type. Additionally, a value type called symbol can represent enumerations of string values (e.g., the 12 different month names). Apart from primitive values, slots can also refer to the model's instances and classes. You can use slots to build relationships and associations between instances, such as between articles and their author(s). Slots store either single or multiple values.

To define a slot for your class, click on the C button above the list of template slots in the Classes tab. This action opens a dialog, shown in Figure 2. If you want an overview of all existing slots in your model, switch to the Slots tab.

Figure 2. Slots are attributes or relationships between classes. The authors slot stores the list of authors. Click on thumbnail to view full-size image.

From what we've seen so far, slots are very similar to conventional object-oriented attributes and relations. However, some important details make slot definitions richer than most object-oriented concepts. A main difference is that a slot can attach to multiple classes. In our magazine project, some but not all Contents subclasses can have subtitles, so we can define a slot subTitle and simultaneously assign it to multiple classes.

Another major difference is that you can specify constraints on slot values. Constraints restrict a slot's range of allowed values. One of these constraints restricts a slot's cardinality. You can specify the minimum and maximum number of values a slot holds. This feature is similar in UML, where you can define cardinalities like [0..1] or [0..*]. Protégé also allows you to define inverse slots and default values for slots. Furthermore, you can restrict the range of numeric slots (integer and float) by minimum and maximum values. All these constraints help you build correct domain models, because Protégé can display an instance's invalid values.

Protégé slots are global objects (i.e., they can even exist without being assigned to a class). You can either globally or individually define their properties for each slot's assigned class. For that purpose, Protégé allows you to override the slots' properties, so you can separately define value type, cardinality, and more for each class. You see the difference when you double-click on a slot in the Classes tab, where Protégé asks if you want to see the "top-level slot" or the "slot at class."

The slot restrictions mentioned so far ensure that the model's instances fulfill simple constraints. For more complex constraints, Protégé has a built-in language called Protégé Axiom Language (PAL). PAL is similar to the Object Constraint Language (OCL) in UML. In the example project, PAL tells Protégé that no online magazine reader can review the same article more than once. Although PAL may look unusual at a first glimpse, it is actually very powerful. Besides PAL, Protégé has some extensions like the JessTab (see below) that also expresses constraints and other kinds of "meaning."

Instances and forms

Now that the classes that describe our domain's concepts and their restrictions have been defined, you can use Protégé's Instances tab to define these classes' instances. Like in Java, instances are specific entities of a given class, such as a specific Article. Protégé tremendously helps in the definition of instances, because it automatically generates graphical forms that contain text fields, radio buttons, check boxes, combo boxes, lists, and other widgets to make editing as convenient as possible. Figure 3 shows a sample form.

Figure 3. Protégé's instances editor. Forms like the one on the right are automatically generated from the class definition. Click on thumbnail to view full-size image.

Using these forms, you or your domain expert can enter instances as soon as an initial draft of classes becomes available. There's no turnaround time between changing a class and getting the corresponding forms. A change to a class automatically rearranges its forms. Therefore, Protégé is excellent for rapid prototyping. You can try different variations of your class hierarchy before you start coding.

Of course the automatically generated forms are not always perfect, but you can change the types of widgets and their layout with little effort. Go to the Forms tab (shown in Figure 4) and select the class's form you wish to edit. Here, you drag and resize the widgets on the screen like in a Java IDE's GUI builder. You can specify that an author's about slot should be edited in a TextAreaWidget instead of a TextFieldWidget.

Figure 4. With Protégé, you can easily adapt the forms that edit instances by dragging widgets. Click on thumbnail to view full-size image.

Projects and ontologies

Protégé projects are sometimes called ontologies. This is an AI term similar to the notion of domain models in software development. An ontology is a collection of domain concepts and their relationships. Ontologies are represented in Protégé and in Java or UML classes, and are available in various forms. The Yahoo! Website directory is considered an ontology. Yahoo! defines concepts like "Health" and "Science" and allows us to browse Websites that match these categories or related topics.

The notion of ontologies also plays a central role in the emerging Semantic Web. The Semantic Web consists of Internet sites that provide information in a "meaningful" format for machines. Online shops provide information on their special offers so your intelligent shopping software finds the best bargains while you sleep. Ontologies store the concepts used to describe this information. For an online shop, the ontology could contain classes like Product with slots like price. Protégé can edit such ontologies and save them in various formats.

Plug-ins

The Protégé system's core is a flexible platform into which additional modules can be plugged in as needed. This mechanism ensures that you can adapt the system for your specific needs. Many of these modules, called plug-ins, were developed by Protégé user community members, some directly at Stanford. Most existing plug-ins are available from the Protégé Website, and you can build your own extensions with little effort.

Protégé currently supports three types of plug-ins: storage plug-ins, slot widgets, and tabs. Let's look at each in turn.

Storage plug-ins

A storage plug-in is a nonvisual module that saves and loads models in a certain file or database format. Protégé currently supports the following file storage formats:

  • CLIPS (C Language Integrated Production System (Protégé's standard format))
  • XML
  • XML Schema
  • RDF
  • OIL (Ontology Inference Layer)
  • DAML+OIL (DARPA (Defense Advanced Research Projects Agency) Agent Markup Language+OIL)
  • UML
  • XMI (XML Metadata Interchange (MOF (Meta-Object Facility) metamodels))
1 2 Page 1