Full-stack software for cutting-edge science

Find out how open source Java- and Python-based tools power the UK's national synchrotron

Jeffrey Beall

Have you ever wondered about the software used in big science facilities? A professional science facility brings together incredibly sophisticated machinery with equally complex software, which is used to do things like drive motors, control robots, and position and run experimental detectors. We also use software to process and store the terrabytes of data created by daily science experiments.

In this article you'll get an inside look at the software infrastructure used at Diamond Light Source (Diamond), which is the UK's national synchrotron. I'll take you through the process of setting up a new experiment, running it, and storing the data for analysis, then I'll introduce each component of Diamond's Java- and Python-based stack. I think you'll find it interesting, and possibly learn about technologies that could be useful to your business. I am excited to show you how science labs are using familiar technologies in new ways.

Laboratory science isn't what it used to be

It's likely that you've heard of at least one large, famous science facility investigating cutting edge physics or nuclear fusion--facilities like Fermilab , CERN, NIF, or ITER. Unless you're a real science geek, it's less likely that you know about synchrotrons and neutron sources. These are facilities designed to discover thousands of smaller but extremely useful facts each year, like the structure of proteins used in medicine, or of fan blades used for a jet engine.

Diamond Light Source, is one such facility. There are many others throughout the world.

Neutron sources and synchrotrons are usually large facilities with a huge physical footprint. They cost many hundreds of millions of dollars to construct. Just a single detector might require more than a million dollars to purchase, install, and maintain, and there are typically scores of detectors in a major facility. If you live near a synchrotron, try to visit on an open day. You will understand the scale of what I'm describing when you see the electron gun, the storage ring, and possibly a detector or two.

Diamond Light Source

Figure 1. Electron storage ring at Diamond Light Source

Big data in modern science

In the early days of x-ray and neutron research, chemical reactions captured the images that scientists used to understand their samples. It was common for experimenters to use photographic plates. One example is Rosalind Franklin's work; she was responsible for the iconic diffraction image Photo 51, taken from a sample of DNA in 1952. Now, experimentalists work on similar problems using electronic detectors that are capable of counting photons directly and storing data to disk.

In Franklin's day, data was relatively minimal and stored physically. These days, experimental data is produced in terabytes and stored digitally. In a synchrotron, the process starts with the machines creating the high-energy light (or more properly electromagnetic radiation) required for experimentation. The latest generation of synchrotron facilities can produce a high flux of photons, which means more photons in each square millimeter of beam. As the machines evolve, they are capable of producing more and more data. The sheer volume of data enables a wider range of experiments and new experimental possibilities. That data also requires increasingly sophisticated software to process and evaluate.

Detectors are able to detect more, and faster, than they've ever done. At Diamond, we use Pilatus and Eiger detectors (first developed at the Swiss Light Source), which are able to make multi-megapixel images at a rate of hundreds per second. PERCIVAL is another type of detector that is being developed for use in science facilities.

During a run, an experimentation machine or system is on duty around the clock, every day but one, used for maintenance; at Diamond this is normally a Thursday. So we are talking about petascale data (250 bytes), which is a thousand times smaller than the famous exascale. Compared to some experimental physics, however, the data is rich in content. Modern science experiments usually require that all of the raw data produced is stored.

In summary, science data today is created at massively high volumes, and that is increasing. Storage requirements are also large, growing, and potentially long-term. A software stack for cutting-edge science must be able to process and store massive volumes of data at rapidly (almost exponentially) increasing scale.

Working with machines

At synchrotrons, we harness the power of electrons to produce super bright light (10 billion times higher flux than the sun) which is channeled into laboratories known as beamlines. Scientists use the different varieties of light in the beamlines to study anything from fossils, jet engines, and viruses to new vaccines.

The machine circumference is more than half a kilometer, so we have to move samples through the beam rather than trying to move the beam around samples. In addition, a researcher cannot stand in the experimental hutch and move the sample. To do so would be much less accurate and efficient than an automated system. More importantly, the light is from high-energy x-rays, which can be extremely hazardous to health.

We use motor-controlled stages and accurate rotating devices called goniometers to move samples. Robotic arms fetch the samples from storage devices called dewars and carousels and place them in the beam.

Diamond Light Source

Figure 2. An example of beamline equipment

In my previous JavaWorld feature I talked about how Diamond's software team migrated our legacy Java server to OSGi. I explained some of the technical challenges involved in the migration, and also how our team adapted to meet those challenges. While I discussed a few technologies in detail, I didn't introduce our full software stack.

A massive-scale science facility depends on many coordinated components. Once a proposal is accepted, the science team submits samples using a web interface. During the experiment, we use software to run the detector and correctly expose it in coordination with robots and motors. When the experiment is complete, we write the data to disk. Finally, we run automatic analysis of the data on a computer cluster.

In the next sections I will introduce a full stack used for scientific experimentation. This one is specific to Diamond Light, but similar to how other facilities have solved the problem as well.

The web interface

We use a fairly conventional Java web server interface to schedule and setup experiments. The server is based on Tomcat with an Oracle database and Spring and Hibernate on the server side. We presently code and maintain the client using GWT (Google Web Toolkit); however, web front-ends seem to evolve rapidly, so that may change.

Scientists use our web interface to submit proposals to use the machine, provide experimental data, and arrange to send samples. Once this process is complete, it is open for the experiment to begin.

The laboratory environment

Users can come to the synchrotron or use it remotely. For some experiments or beamlines, an increasing amount of experimental time is remote. For many users, remote use has already become the normal way to use the beamline. Whether you're using the beamline from a local control cabin or a remote desktop, you get the same environment to run your experiment.

At the time of this writing, the laboratory environment is built on top of RHEL6, with a thick client based on Eclipse RCP. (We use an e4-based platform.) Scientists use the front-end interface to move motors and view output. The interface is designed to speak the language of the experimenter, to allow them to define the experiment easily.

Working backward from the front-end, there is an acquisition server and middleware layer, and a hardware control layer.

Figure 3 is a diagram of the system, going from the hardware toward the front-end, which the user sees. I'll introduce each layer separately.

Matthew Gerring

Figure 3. Full stack software diagram

Embedded layer

Most hardware devices will come with an API to which you can send commands. This embedded layer is device- and vendor-specific. It is integrated to a standard control layer as follows. In our case, this is the EPICS control system, which is probably the most common solution. Another protocol (and methodology) commonly used by large science facilties is TANGO.

Hardware control

This layer is concerned with taking the embedded layer and providing a standard and performant interface to the device. In the case of a physical motor at Diamond, we use a PMAC controller, which is linked into EPICS. The PMAC is linked to an IOC (Input/Output Controller) based on Linux or VxWorks for realtime support.

The IOC is composed of PMAC drivers, a PMAC database, and a channel access port. This last part exposes the device over the ethernet. That means it's possible for any program (Java included) to connect to the device and control it. For a motor, there may be many process variables, or PVs. These can be thought of like addressed configuration values, and they are made available over a protocol called channel access. Basically, at the end of this we have a protocol which may be used to move and configure devices. You could think of it as being similar to Redis for devices: records may be read and written, and devices respond to the given state.

Malcolm

Malcom is the custom middleware server we built at Diamond, and is currently only in use here. Other synchrotrons like the Australian Synchrotron and the European Synchrotron Radiation Source have similar C-Python layers. When a user requests a scan, Malcolm figures out which motors are fast, which are slow, and how the detector may be triggered. It encapsulates this into an individual runnable device, which is a component that the Java server uses when the user scans something.

Malcolm also coordinates a pipeline system called area detector, which is part of EPICS and is used to write files in a performant way. Remember when I said that the detectors can produce data quickly? The area detector is responsible for piping that data to file.

Files and file systems

For the next layer of this architecture, look to the right of Figure 3. Here we travel into HDF5 and NeXus land. First, the synchrotron's detector is integrated to an area detector pipeline and controlled by Malcolm. Next, the gigabit-ethernet connects data to our scalable distributed file system. You could be thinking that's just fine and the job is done. In fact, you need a file format that can scale, and that is where HDF5 comes in. HDF5 is a popular binary file format used for experiments, modeling, and financial applications (to name just a few examples). HDF5 allows nD data to be written at fast speeds. One of its great strengths is that is allows numerical data to be accessed as if it were in-memory, despite perhaps being hundreds of gigabytes in size.

We worked with other interested parties and the HDF5 Group to extend the format to allow writing and reading of the same dataset by different processes at the same time (SWMR). This means that while the detector data is writing as fast as it can--to a stack of images, for instance--data earlier in the same sequence is being processed on the cluster by our analysis packages. NeXus is also a HDF5 file but it is written with a particular structure. This allows experimental information to be recorded in such a way that analysis packages can use it.

For instance, you might have a large block of detector data, but what were the values of the stage motors when the detector triggered? NeXus records these "axes" and other experimental data like the temperature and pressure in a standard way. The NeXus file in our case contains links to HDF5 file output via Malcolm and the area detector. This turned out to be the fastest possible way of writing data when we benchmarked on test rigs.

Data acquisition server

At last we come to some Java! The Data acquisition server is an OSGi application running on a dedicated Linux server rack machine on the beamline. It is a client to the parallel distributed file system and can read and write data with high performance. The server coordinates writing the NeXus file and runs scans which either use a Malcolm device or find the detector directly using EPICS channel access. It then sends instructions to carry out the experiment to the different devices and relays progress back to the client. In addition, the acquisition server will start and monitor online data analysis, which provides realtime feedback to the user. This is particularly useful to scientists because the more processed data they can see live during the experiment, the more effectively they can use the system and the better their data will be.

Data acquisition front-end

The front-end is written in Java, is OSGi based (Eclipse RCP), and uses native widgets (via SWT/JFace) to render quite a complex user interface to do the experiment. To break up the huge GUI, we use a concept called perspectives, which are like miniature programs. Each perspective presents a different workflow to the user, for instance alignment and calibration, scanning and experiment, scripting and data visualization. Currently we have one user interface product built for each experiment from a common code base. We are currently investigating whether it would be possible to reduce build and support resources in the future, by creating a single product that has perspectives switched on and off depending on the context of usage. This would also allow developers to collaborate better and have fewer areas of the code where only they work. Sharing workload and expertise is good.

Data analysis workbench: DAWN

One form of the data analysis done is DAWN, which is a freely downloadable, open source (with a permissive license), and thoroughly neat Java application. It is a user interface, a C-Python IDE and an analysis algorithm pipeline tool able to run headless on a cluster. In user interface mode, you can interact with your data visually (see the DAWN web site for many examples of this). You can construct a pipeline which you can then deploy to run as automatic processing with your experiment. This means that you can graphically create processing on some test data or data from a previous visit and then deploy it to run with our acquisition server during your experiment.

The multiple entry points to the program are achieved using Equinox's IApplication, which has one or more instances in a product, providing simple methods to start and stop the program. DAWN components are used to run the analysis pipelines in this way on the cluster and in the data acquisition user interface for online data analysis.

Python automated processing

Another option for results processing, is pre-built and automatic. This is created by experts in the analysis of data and comes as a complex layer of different third-party and in-house developed packages glued together with Python and running on our cluster. Often the data analysis professionals have completed long-running research about how to process data into results. The different pipelines produced are deployed to run for the user automatically or offered as fixed choices. These analysis pipelines are forged at universities, post-doctoral posts on site, and by professional software developers. For instance, they might take diffraction data and use various algorithmic routes to predict an electron density map for a protein that a scientist is studying.

More APIs

You may have noticed that the full stack diagram shows some little boxes saying "January" and "Scanning." These probably don't mean much to you. These are open source APIs that we've helped to develop and to which we're contributing. The Eclipse Foundation–specifically their Science group–are helping with the process of making open source projects. The Science group have worked with the community on the four listed below. If you think any could be useful in your project, they are open source and available on GitHub.

  • January is a project for data in Java similar to how numpy works for Python. Several developers invested time over more than 10 years (elapsed) to create a numpy-like API for Java. It also integrates with Python, its data structures traverse to and from C-Python as numpy arrays and have equivalents in Jython. Not only that but the January nD arrays, known as "datasets" in January terminology, have a lazy option. This is defined by the interface ILazyDataset and it basically is a data array which does not yet exist in physical memory but can do, if sliced. This concept turns out to be extremely useful for large data: using this mechanism, we created user interfaces to interact with data bigger than conventional commercial programs can handle. For instance, we created an Isosurface algorithm which ran fast on huge datasets using low-end laptops for users trying to visualize data.
  • Scanning is a new project which encapsulates algorithms designed to drive stage motors (or any device) and expose detectors. We use a Java ExecutorService to run devices to their required positions for a scan. This is multithreaded and moves devices in the most efficient asynchronous way, however we also have a concept of level, which is applicable to each device.
    Devices are encapsulated by an interface called IScannable and each has a level. The level is set in the Spring layer which creates the devices. The algorithm works by moving everything simultaneously using an ExecutorService for a given level but those devices on separate levels do not collide or interfere because they are not moved at the same time. This simple concept allows groups of devices to be moved together efficiently and safely.
  • RichBeans is a simple, low-dependency widget project. It provides widgets (in SWT) and a method for data binding them to huge bean trees (this works via an interface so it is SWT independent). There are some scientific widgets in this project and basic data entry ones along with reflection-based binding for them. (Note that the reflection binding does not rely on setAccessible(true), so will support the forthcoming changes in Java 9.)
  • Dawnsci is a project that declares the long-term APIs available to developers extending the DAWN product. As I previously said, DAWN is a critical part of the analysis layer in Diamond's software stack. It provides an implementation to our plotting functions which the front-end products use, for instance, for data acquisition. So, in this case we are using an open source project to do something a little different than usual. It is a way of making the DAWN product's programming interface modular and long-term supported. It has proved a great path to publish the API which DAWN backs. The dependencies which a project uses due to DAWN are made clear and public to the programmer. It is also clear how to work with DAWN developers to make changes, for instance to add new plotting types.

Conclusion

This article should give you some sense of the full stack of applications required for cutting-edge science today. For the most part, we use off-the-shelf technologies written in Java and Python. Where new software has been identified, we attempt to encapsulate it into reusable projects so that the whole community can benefit.

We're in early days for software used for large experiments, and there are many areas for improvement. Lots of experiments are still configured by hand and only operated by machine. Software used for configuration, setup, and experimental automation will surely improve in the near future.

Additionally, you can see in the rest of industry that cloud technologies have been adopted. Instead of data being copied to an onsite data center, in the future there will be one or more science clouds available, where researchers can share large data sets with ease. Several clouds are already becoming popular for this, and future software platforms will integrate better with these data clouds.

The artificial intelligence (AI) revolution is also just beginning. Future scientific experiments will benefit from machine learning on past data, which can find existing datasets or predict outcomes. This will mean that accurate estimated results will be possible before an experiment is completed, though of course duplication is still required for science to work properly. Perhaps an AI with low-level hardware control will be able to work in concert with researchers. In my mind, the question is not whether AI will find its place in experimental laboratories, but how soon that will happen.