Objects versus documents for server-client interaction, Part 1

Comparing two ways that software can interact with software

These days the dominant way that servers interact with people across the network is by sending HTML documents to Web browsers. Recently, XML has generated a lot of excitement among developers as an alternative document format that offers many advantages over HTML. Like HTML, XML enables servers to interact with people across the network via their Web browsers. But unlike HTML, XML also enables servers to easily interact with client software that has no user present.

TEXTBOX: TEXTBOX_HEAD: Using objects versus documents for server-client interaction: Read the whole series!

:END_TEXTBOX

In the Jini universe, in contrast to the document approach of both HTML and XML, servers interact with client programs by sending objects across the network. Like XML, Jini enables servers to interact with client programs regardless of whether a user is present at the client.

In this three-part series, I will compare and contrast two fundamental ways that servers can interact with clients: using documents and using objects. In this article, the first of three parts, I'll look primarily at how objects and documents compare when servers interact with client programs that have no user present.

Creating a Java news page

I recently wrote a Python script to generate a Java news page for my Website, Artima.com. I planned to get the news items from Moreover.com, which offered a free news feed devoted to Java. As a Webmaster, I had several options, all of which involved servers sending documents to clients.

Perhaps my most straightforward option was to insert a large, hairy chunk of JavaScript code, kindly provided by Moreover.com, into my page. Whenever a user visited my Java news page, the embedded JavaScript would land in his or her browser, contact Moreover.com, grab the most recent Java news data, and construct the news page on the fly. I discarded this option partly because I have found JavaScript to be unreliable (as a result, my site contains no JavaScript), but primarily because I didn't want the user to have to wait for the JavaScript to make a socket connection to Moreover.com in order to grab the data. One of my main goals for Artima.com is to have pages that load quickly, and every socket connection takes time.

Another option was to use a script that ran on the server. In that approach, the URL of my news page would actually refer to a script. When a user hit the URL, the Web server would run the script. The script would contact Moreover.com and obtain the news information in the same way the JavaScript would. Again, I discarded this option because I didn't want the client to have to wait for that socket connection to Moreover.com.

Ultimately, I decided to write a script that contacted Moreover.com, grabbed the most recent Java news data, generated my Java news page, and saved the page in a file. I planned to set up a cron job that automatically ran the script every hour, so that the file would be refreshed regularly. In this approach, the user wouldn't have to wait for a socket connection, because it would be made behind the scenes once every hour. Given that Moreover.com seemed to be updating the contents of its Java news feed at most once or twice a day, I decided that an hourly poll would yield a sufficiently fresh page for my Website.

Deciding upon a data format

Moreover.com offers its news feeds in several data formats, each available at a different URL. Thus, I next had to decide which data format my script should use for processing.

One data format that I did not choose, but which I'd like to mention here, is HTML. Among other data formats, Moreover.com offers an HTML Webpage full of the latest Java news. The trouble with this approach, of course, is that HTML pages are intended to be consumed by people, not programs. Although the information my Python script needs is contained in an HTML page, the page's markup tags make it difficult for programs like my script to acquire the information. Rather, HTML markup tends to focus on enabling a Web browser to render the information buried in a screen's markup, so that a human user can gaze upon the screen and pull the information into his or her brain.

In HTML, information intermingles freely with directions on presenting that information. For example, here's a snippet of HTML code from the HTML news page at Moreover.com:

<TR BGCOLOR="#ffffff"><TD><FONT FACE="Arial, Helvetica, sans-serif">
<A HREF=http://c.moreover.com/click/here.pl?j6547539 TARGET="_blank"><FONT SIZE="-1" COLOR="#333333"
><B>Java, XML to survive Sun/Microsoft war...</B></FONT></A><BR>
<A HREF=http://www.vnunet.com/ TARGET=_blank>
<FONT SIZE="-2" COLOR="#ff6600">vnunet.com</FONT></A>
<FONT SIZE="-2" COLOR="#ff6600">  Wed Apr 12 09:34:25 GMT-0700 (Pacific Daylight Time) 2000</FONT>
</TD></TR><TR BGCOLOR="#ffffff"><TD BGCOLOR="#ffffff" HEIGHT="5"></TD></TR>

Aside from the trouble of parsing out the information from all this HTML markup, a far more insidious problem exists with the parsed-HTML approach. Given that HTML pages are intended to be rendered by browsers and read by people, Webmasters have no qualms about changing their pages in ways that browsers and people can deal with, but programs cannot. So even if I decided to parse the information out of the HTML, chances are good that eventually Moreover.com's Webmaster would make a change to its Webpages' structure that would break my script.

Looking at XML

The document-style format that looked most promising to me was Moreover.com's XML feed. XML was designed to enable just the kind of software parsing I wanted to do in my Python script. In an XML document, in contrast to one in HTML, information and presentation are cleanly separated. The information contained in the document is marked up in tags that, rather than describe how the information should be presented, hints at the semantic meaning of the information. For example, here's a snippet of XML code from the XML feed at Moreover.com:

      <article id="_6547546">
         <url>http://c.moreover.com/click/here.pl?x6547539</url>
         <headline_text>Java, XML to survive Sun/Microsoft war</headline_text>
         <source>vnunet.com</source>
         <media_type>text</media_type>
         <cluster>Java news</cluster>
         <tagline> </tagline>
         <document_url>http://www.vnunet.com/</document_url>
         <harvest_time>Apr 12 2000  4:34PM</harvest_time>
         <access_registration> </access_registration>
         <access_status> </access_status>
      </article>

Directions on how to present the information contained in the XML document's semantic tags can be defined separately, using a style markup language such as CSS or XSL. In the Moreover.com case, the XML document is intended to be consumed only by programs, not by people, so no style markup is provided. Nevertheless, the primary reason my Python script could parse the XML feed more easily than the HTML feed is that XML is designed to avoid HTML's intermingling of information and presentation.

Settling on tab-separated values

I liked the XML approach, but unfortunately I was unable to figure out quickly enough how to work with XML in Python. All I wanted to do was pass a chunk of XML to some library routine, get back a nice data structure corresponding to the XML document, and use it to effortlessly write out the news page. I was (and still am) on the Python learning curve, and as I was rooting around in the Python documentation looking for my desired library routine, I noticed that Moreover.com also offered a tab-separated value (TSV) feed. At that point I paused and said to myself, "Self, if you just use this TSV feed, then you can get this job done right now." For reasons of speed, therefore, I abandoned my search for the elusive XML-to-data-structure Python library routine and completed my script using the TSV feed.

Here's one line from the TSV feed at Moreover.com. (The single line is split into three lines with \\ and tabs are replaced with \t here, but not in the actual feed.)

http://c.moreover.com/click/here.pl?t6547539\t\Java, XML to survive Sun/Microsoft war\tvnunet.com\ttext\t\Java news\t \thttp://www.vnunet.com/\tApr 12 2000  4:34PM\t \t 

XML, data models, and DTDs

The structure and tag names in Moreover.com's XML feed form a "data model" of a news feed. Moreover.com thought about what it meant to be a news feed. It identified and gave a name to each piece of information, gave each item the name "article," and decided that its XML document would be an ordered list of articles. (The TSV version also represents a minimalist expression of the same conceptual data model.)

XML lets you express your data model in a Data Type Definition (DTD). In fact, Moreover.com provides the DTD for its XML news-feed documents. The DTD looks like this:

<!ELEMENT moreovernews (article*)>
<!ELEMENT article (url,headline_text,source,media_type,cluster,tagline,document_url,
harvest_time,access_registration,access_status)>
<!ATTLIST article id ID #IMPLIED>
<!ELEMENT url (#PCDATA)>
<!ELEMENT headline_text (#PCDATA)>
<!ELEMENT source (#PCDATA)>
<!ELEMENT media_type (#PCDATA)>
<!ELEMENT cluster (#PCDATA)>
<!ELEMENT tagline (#PCDATA)>
<!ELEMENT document_url (#PCDATA)>
<!ELEMENT harvest_time (#PCDATA)>
<!ELEMENT access_registration (#PCDATA)>
<!ELEMENT access_status (#PCDATA)>

I won't go into the details of the DTD syntax, but basically, Moreover.com's DTD says that each of its news-feed documents (named "moreovernews") are composed of a set of zero or more "articles." Each article is composed of several pieces of information, including a "url," a "headline_text," and so on. In short, an XML DTD is a written definition of the abstract data model to which an XML document adheres.

Data models and network protocols

Lurking behind all the communication approaches between Moreover.com and Artima.com is an important assumption: that the client will fetch the document via the HTTP's GET command. In fact, perhaps a better way to look at Moreover.com's document formats is as a part of several high-level protocols that define the interaction between Moreover.com's clients (such as Artima.com) and its server. The combination of a news category URL, the low-level HTTP GET protocol, and Moreover.com's XML DTD, for example, combine to form a high-level network protocol, which can be summarized as follows:

A fetch protocol

  1. The client opens a socket to the Web server at Moreover.com.
  2. The client requests the most recent news of a particular category using the HTTP GET command over the socket, passing the document name given in the news category URL.
  3. The Web server sends an XML document containing the news in a format defined by XML and the Moreover.com DTD.
  4. The client closes the socket. (In HTTP 1.1, the client closes the socket so that the socket can be reused to grab other pieces of a Webpage, such as images, referenced from the original document. In this case, there are no other pieces to grab.)
  5. The client uses the XML DTD to parse and interpret the document.

An alternative protocol

The Python script currently executing at Artima.com plays the client role in a protocol that corresponds closely to the Fetch protocol. The difference is that my Python script fetches a TSV, not an XML, document. The TSV format does not come with an official DTD, but conceptually its structure corresponds to the same data model described by the XML DTD.

Now although my Java news page seems to be working fine, the truth is, I'd prefer that Moreover.com notify my Website whenever it changed the contents of its Java news feed. That way I would need to rewrite my Java news page only when its contents actually change. Since I would be notified of changes rather than polling hourly, my news page would be updated more promptly whenever new news appeared.

If Moreover.com is ever to offer such a notification-based approach, it will have to define a protocol that implements one. Given that the server will be "pushing" a notification down to the client, rather than relying on the client to "pull" the latest news from the server, the client will probably have to have some kind of server running. For Moreover.com to know where those client-side servers are, and what categories of news each client-side server wants, a protocol that lets clients subscribe to the notification service will be necessary (in addition to the notification protocol itself). Here are outlines of a subscription protocol and a notification protocol, in which I call the client-side server a "listening" server:

A subscription protocol

1 2 3 Page 1
Page 1 of 3