OSGi at the UK's biggest science lab

Developers at Diamond Light Source set out to migrate a mission-critical, Java-based acquisition system to dynamic class loading. Here’s what they learned.

diamond light source synchrotron main chamber
Credit: Matt Brown

As a Java developer, you undoubtedly know about the goodness of OSGi and breaking up your class loading into modules. After all, OSGi is the dynamic module system, right? Not a huge deal. You might have played around with declarative services, or perhaps you are waiting for Jigsaw? Java these days is a very mature technology stack and even though the barrier of adoption for OSGi is low–and I mean really low–plenty of products have yet to migrate to dynamic class loading.  This is especially true if the product is large, mature, and not intended for a major refactor. At three million lines of Java-server and a thick-client code, our product at Diamond Light Source fits exactly that description. Nonetheless, we recently moved our code base to OSGi. In this article I’ll explain why we made the change and offer seven real-world challenges encountered and how we resolved them.

Java technology, applied to science

Diamond Light’s synchrotron works like a giant microscope, harnessing the power of electrons to produce bright light that scientists can use to study anything from fossils to jet engines to viruses and vaccines. The United Kingdom’s  largest science project and one of the world’s most advanced facilities, the synchrotron is used by over 10,000 scientists running experiments.

To produce the high energy light that scientists need to conduct their research, engineers at Diamond Light accelerate electrons, then move them around using magnetic fields. The light comes out of a circular machine, which is the starting point for a huge range of experimental techniques. Where it exits, the light goes through an optics hutch. Individual experiments are run using our Java-based acquisition system.

The Diamond Light Source synchrotron

Figure 1. Beamlines radiating from the Diamond Light Source synchrotron. Image credit: Diamond Light Source.

The linear experimental parts of the facility, radiating from the circular synchrotron, are called beamlines. Currently, 33  beamlines are either in operation, in construction, or being designed. All of them have or will require a Java server and a client able to coordinate experiments and serve as an interface for visiting scientists to control the synchrotron. The software must be able to move motors (because there are usually x-rays in the experimental hutch of the beamline); trigger detectors (imagine something like a digital camera); and write large binary files of data, often at high rates. Some beamlines require detectors able to write many megabytes of data at a kilohertz rate.

Modernizing a legacy system

Part of Diamond Light’s Java software stack was inherited from the Daresbury synchrotron called the Synchrotron Radiation Source. The SRS was closed in 2008, but some of its software lived on in Diamond Light’s acquisition and analysis systems.  This has been very useful because, as we know, algorithms never die (although they may mutate). While some of the major features and ideas for working with the software came to us from SRS, a developer today might choose to do some things differently. For instance, the legacy system’s client and server used CORBA to communicate. It also employed a large classpath with many interconnected dependencies in the server. And it relied on a thick client based on Swing. The client had a neat contributed design, however, which allowed custom experimental parts to be mixed with general purpose ones; that was a capability we didn’t want to lose.

For our first foray into OSGi, we chose to migrate and re-write part of the client. We moved from Swing to SWT/JFace using the Rich Client Platform (RCP) which is available from the good people at Eclipse Foundation. The move led us to adopt an Equinox classloader for our client. It was not our intention to migrate the client because of dynamic class loading; that was something that came with the new platform. We used it first, not to make the server modular, but to make the client start faster.  It worked well for that purpose, so there was no real reason to modify the server architecture. For the next five years, we didn’t.

The creeping cost of support

So what changed? Well, like a lot of real-world projects, the proportion of maintenance work developers were doing started going up. This was especially costly when compared to time spent writing software that the new beamlines needed. In many cases, maintenance became almost all a developer was doing. These days we run a variation of a devops shop, so developers are usually involved with supporting systems as part of their work. This is the correct approach for us, but if developers aren’t also innovating and creating new software, we know that something has gone wrong.

A lot of what we do at Diamond Light requires creative input from developers to get the new science available to our users. But over time, we built up technical debt. Some signs of our technical debt included:

  • Using different APIs that do the same or similar things
  • Making overly interconnected projects and classes
  • Improper encapsulation of functionality
  • Writing adequate unit tests rarely

Another thing we did was to run from source. Yes, you read that correctly: we manually pulled out the software from its repository and built it specially for each experiment, leaving the source code and bytecode compiled as a bespoke version for each given beamline. We have all the usual stack of integration tools, an automated build for each beamline in Jenkins, JUnit tests, Squish for the user interface, and so on, but ultimately a developer was pulling a custom product out of the repositories, changing certain files by hand, and leaving this version for the next run of the machine.

The system wasn’t efficient or reproducible, which meant it had to change. After reading online articles, learning at industry conferences, and taking input from new colleagues, we came up with a plan to move our server to something closer to industry standards. The path, however, was not entirely smooth.

Real world problem #1: Integration

The first thing we decided to do was make a single-server product for the data acquisition server. One that could be used on any beamline and was created with a binary built from a reproducible build. OSGi was a perfect fit for this project: bundles are loaded dynamically, after all, and one of the main reasons for dynamic classloading is that the binary product size can grow beyond that which is in memory. Using OSGi meant that beamline-specific bundles–for example, those dealing with certain detectors or specific libraries for decoding streams–could be built into the single product. Only if they were used on an experiment would they be class-loaded and take up space in the virtual machine (VM).

So far so good, but we had lost one strong advantage: the original “running from source” approach to code integration, which had allowed us to change and debug beamline code on the fly. We needed this integration capability because our developers are often required to deliver complex and variable requirements at the last minute–such as integrating a laser into the data acquisition timing. Fortunately, it’s possible to insert code into a running OSGi VM using various tools: we looked at Hot Swap Agent and JRebel seriously. After some deliberation, we chose JRebel because it integrated easily with our server. Our OSGi-based system requires that developers commit and build/test code into the single product before it is left as a product on a beamline, but using JRebel gives us the flexibility to develop code (temporarily) on the live system.

Real world problem #2: Multiple configurations

We were already using Spring as an instantiation layer. For us it wires together a beamline’s configuration, building things like connections to motors, detectors, and online data analysis. We chose to keep our Spring configurations unchanged and run them with the single server, so all the existing classes in disparate Java projects had to work. When the server starts, the OSGi container loads the main classes, after which the Spring configuration is run. In some cases, Spring could require a class that the OSGi container had yet to load. Solutions like Blueprint with Apache Aries and Spring DS are often well suited for such scenarios. In our case, because we’re using Equinox, we decided to use Eclipse Buddy, which is less elegant but it works. Two things were essential about the config:

  • When making the bundle with Spring JAR files in it, for instance called org.acme.spring.bundle, the manifest  should contain the header Eclipse-BuddyPolicy: registered.

  • The bundle containing the class to be loaded by Spring also must contain the header Eclipse-RegisterBuddy: org.acme.spring.bundle in its manifest file. This allows the Spring bundle to look it up.

This approach is Equinox-specific rather than a standard OSGi feature. However, because it is only a manifest entry, it should be inexpensive to change our Spring and OSGi integration to something more standard later.

Real world problem #3: Migrating to bundles

OSGi bundles at their best are like another layer of encapsulation over the level of the class with which all developers are familiar. From where we were, moving to a culture of bundles with minimal and well understood dependencies was a different matter. Developers were used to certain areas of the code that “glue together” the product and depend on many things; we called this the core [cue kettle drums sound]. Later, I will discuss how we use declarative services, but one way we’ve been able to make services work is to define commonly used interfaces and beans in no-dependency bundles. (And in this case that no definitely means no.) We then use declarative services to provide the implementation without dependency, so a core is not required. Instead, we have bundles that use and do things, and bundles that provide those things.

Getting developers working in this way requires a culture change. While the shift is ongoing, many have embraced the idea. We chose not to remove the core bundles or refactor them directly in one go. Rather than having n-sided developer battles, we decided to move to the right design in new work. Refactoring can in this case be done later, once ideas spread organically, by training and sticking to good practice in new bundles. Our problem with core bundles does not have to be solved right away, but we are chipping away at it using the no-dependency bundles.

Real world problem #4: The static, non-modular algorithm

Today in our server there around a hundred OSGI declarative services for things like loading files, getting interfaces to hardware, writing data to a fast distributed file system (we use GPFS and Luster), talking to FPGA-based devices by description language, sending text messages to a port on a custom Linux device that controls a detector, and much more besides! In fact, depending on the experiment, the various bridging bundles and device libraries can easily outnumber the scanning algorithm itself.

The scan algorithm is the heart of the data acquisition system. It is one of the parts that brings together separate concepts like devices and file writing and runs them together in order to collect useful data for the user. On the face of it, there wasn’t much wrong with our existing scanning system. Having been honed by several generations of developers (using the standard developer lifetime of seven years), it was pretty fast, robust, and had a useful Jython layer with which to extend it.

The scan did have a problem, though, in that it could not deal with a new file-writing design, which was introduced in a separate project and had to be integrated to our software. It was intended to store data statically and written in a non-modular way, which made it expensive to adapt. We decided to solve the file-writing requirement and at the same time migrate scanning to OSGi. So scanning was one of the few parts of the system, and the most important part, that we did choose to rewrite. The final algorithm spanned a few thousand lines of code and its bundle is shown expanded below.

The OSGi bundle for Diamond Light's scan algorithm

Figure 2. The OSGi bundle for the scan algorithm. Image credit: Matthew Gerring.

The main algorithm of the scan is an iteration over n-dimensional motor positions running objects that manage fork/join pools. I’ve printed part of it here, and it’s also available under an open source license on GitHub.

Listing 1. The main algorithm for Diamond Light's scanning service

for (IPosition pos : moderator.getOuterIterable()) {
 	  // Check if we are paused, blocks until we are not
 	  boolean continueRunning = checkPaused();
 	  if (!continueRunning) return; // finally block performed
 	  // Run to the position
    	  manager.invoke(PointStart.class, pos);
 	  positioner.setPosition(pos);   // moveTo in GDA8
 	  writers.await();               // Wait for the previous write out to return, if any
    	  nexusScanFileManager.flushNexusFile(); // flush the nexus file
 	  runners.run(pos);          	// GDA8: collectData() / GDA9: run() for Malcolm
 	  writers.run(pos, false);   	// Do not block on the readout, move to the next position immediately.
 	  // Send an event about where we are in the scan
    	  manager.invoke(PointEnd.class, pos);
 	  positionComplete(pos, count, size);

1 2 Page 1