Full-stack software for cutting-edge science

Find out how open source Java- and Python-based tools power the UK's national synchrotron

1 2 Page 2
Page 2 of 2

Embedded layer

Most hardware devices will come with an API to which you can send commands. This embedded layer is device- and vendor-specific. It is integrated to a standard control layer as follows. In our case, this is the EPICS control system, which is probably the most common solution. Another protocol (and methodology) commonly used by large science facilties is TANGO.

Hardware control

This layer is concerned with taking the embedded layer and providing a standard and performant interface to the device. In the case of a physical motor at Diamond, we use a PMAC controller, which is linked into EPICS. The PMAC is linked to an IOC (Input/Output Controller) based on Linux or VxWorks for realtime support.

The IOC is composed of PMAC drivers, a PMAC database, and a channel access port. This last part exposes the device over the ethernet. That means it's possible for any program (Java included) to connect to the device and control it. For a motor, there may be many process variables, or PVs. These can be thought of like addressed configuration values, and they are made available over a protocol called channel access. Basically, at the end of this we have a protocol which may be used to move and configure devices. You could think of it as being similar to Redis for devices: records may be read and written, and devices respond to the given state.

Malcolm

Malcom is the custom middleware server we built at Diamond, and is currently only in use here. Other synchrotrons like the Australian Synchrotron and the European Synchrotron Radiation Source have similar C-Python layers. When a user requests a scan, Malcolm figures out which motors are fast, which are slow, and how the detector may be triggered. It encapsulates this into an individual runnable device, which is a component that the Java server uses when the user scans something.

Malcolm also coordinates a pipeline system called area detector, which is part of EPICS and is used to write files in a performant way. Remember when I said that the detectors can produce data quickly? The area detector is responsible for piping that data to file.

Files and file systems

For the next layer of this architecture, look to the right of Figure 3. Here we travel into HDF5 and NeXus land. First, the synchrotron's detector is integrated to an area detector pipeline and controlled by Malcolm. Next, the gigabit-ethernet connects data to our scalable distributed file system. You could be thinking that's just fine and the job is done. In fact, you need a file format that can scale, and that is where HDF5 comes in. HDF5 is a popular binary file format used for experiments, modeling, and financial applications (to name just a few examples). HDF5 allows nD data to be written at fast speeds. One of its great strengths is that is allows numerical data to be accessed as if it were in-memory, despite perhaps being hundreds of gigabytes in size.

We worked with other interested parties and the HDF5 Group to extend the format to allow writing and reading of the same dataset by different processes at the same time (SWMR). This means that while the detector data is writing as fast as it can--to a stack of images, for instance--data earlier in the same sequence is being processed on the cluster by our analysis packages. NeXus is also a HDF5 file but it is written with a particular structure. This allows experimental information to be recorded in such a way that analysis packages can use it.

For instance, you might have a large block of detector data, but what were the values of the stage motors when the detector triggered? NeXus records these "axes" and other experimental data like the temperature and pressure in a standard way. The NeXus file in our case contains links to HDF5 file output via Malcolm and the area detector. This turned out to be the fastest possible way of writing data when we benchmarked on test rigs.

Data acquisition server

At last we come to some Java! The Data acquisition server is an OSGi application running on a dedicated Linux server rack machine on the beamline. It is a client to the parallel distributed file system and can read and write data with high performance. The server coordinates writing the NeXus file and runs scans which either use a Malcolm device or find the detector directly using EPICS channel access. It then sends instructions to carry out the experiment to the different devices and relays progress back to the client. In addition, the acquisition server will start and monitor online data analysis, which provides realtime feedback to the user. This is particularly useful to scientists because the more processed data they can see live during the experiment, the more effectively they can use the system and the better their data will be.

Data acquisition front-end

The front-end is written in Java, is OSGi based (Eclipse RCP), and uses native widgets (via SWT/JFace) to render quite a complex user interface to do the experiment. To break up the huge GUI, we use a concept called perspectives, which are like miniature programs. Each perspective presents a different workflow to the user, for instance alignment and calibration, scanning and experiment, scripting and data visualization. Currently we have one user interface product built for each experiment from a common code base. We are currently investigating whether it would be possible to reduce build and support resources in the future, by creating a single product that has perspectives switched on and off depending on the context of usage. This would also allow developers to collaborate better and have fewer areas of the code where only they work. Sharing workload and expertise is good.

Data analysis workbench: DAWN

One form of the data analysis done is DAWN, which is a freely downloadable, open source (with a permissive license), and thoroughly neat Java application. It is a user interface, a C-Python IDE and an analysis algorithm pipeline tool able to run headless on a cluster. In user interface mode, you can interact with your data visually (see the DAWN web site for many examples of this). You can construct a pipeline which you can then deploy to run as automatic processing with your experiment. This means that you can graphically create processing on some test data or data from a previous visit and then deploy it to run with our acquisition server during your experiment.

The multiple entry points to the program are achieved using Equinox's IApplication, which has one or more instances in a product, providing simple methods to start and stop the program. DAWN components are used to run the analysis pipelines in this way on the cluster and in the data acquisition user interface for online data analysis.

Python automated processing

Another option for results processing, is pre-built and automatic. This is created by experts in the analysis of data and comes as a complex layer of different third-party and in-house developed packages glued together with Python and running on our cluster. Often the data analysis professionals have completed long-running research about how to process data into results. The different pipelines produced are deployed to run for the user automatically or offered as fixed choices. These analysis pipelines are forged at universities, post-doctoral posts on site, and by professional software developers. For instance, they might take diffraction data and use various algorithmic routes to predict an electron density map for a protein that a scientist is studying.

More APIs

You may have noticed that the full stack diagram shows some little boxes saying "January" and "Scanning." These probably don't mean much to you. These are open source APIs that we've helped to develop and to which we're contributing. The Eclipse Foundation–specifically their Science group–are helping with the process of making open source projects. The Science group have worked with the community on the four listed below. If you think any could be useful in your project, they are open source and available on GitHub.

  • January is a project for data in Java similar to how numpy works for Python. Several developers invested time over more than 10 years (elapsed) to create a numpy-like API for Java. It also integrates with Python, its data structures traverse to and from C-Python as numpy arrays and have equivalents in Jython. Not only that but the January nD arrays, known as "datasets" in January terminology, have a lazy option. This is defined by the interface ILazyDataset and it basically is a data array which does not yet exist in physical memory but can do, if sliced. This concept turns out to be extremely useful for large data: using this mechanism, we created user interfaces to interact with data bigger than conventional commercial programs can handle. For instance, we created an Isosurface algorithm which ran fast on huge datasets using low-end laptops for users trying to visualize data.
  • Scanning is a new project which encapsulates algorithms designed to drive stage motors (or any device) and expose detectors. We use a Java ExecutorService to run devices to their required positions for a scan. This is multithreaded and moves devices in the most efficient asynchronous way, however we also have a concept of level, which is applicable to each device.
    Devices are encapsulated by an interface called IScannable and each has a level. The level is set in the Spring layer which creates the devices. The algorithm works by moving everything simultaneously using an ExecutorService for a given level but those devices on separate levels do not collide or interfere because they are not moved at the same time. This simple concept allows groups of devices to be moved together efficiently and safely.
  • RichBeans is a simple, low-dependency widget project. It provides widgets (in SWT) and a method for data binding them to huge bean trees (this works via an interface so it is SWT independent). There are some scientific widgets in this project and basic data entry ones along with reflection-based binding for them. (Note that the reflection binding does not rely on setAccessible(true), so will support the forthcoming changes in Java 9.)
  • Dawnsci is a project that declares the long-term APIs available to developers extending the DAWN product. As I previously said, DAWN is a critical part of the analysis layer in Diamond's software stack. It provides an implementation to our plotting functions which the front-end products use, for instance, for data acquisition. So, in this case we are using an open source project to do something a little different than usual. It is a way of making the DAWN product's programming interface modular and long-term supported. It has proved a great path to publish the API which DAWN backs. The dependencies which a project uses due to DAWN are made clear and public to the programmer. It is also clear how to work with DAWN developers to make changes, for instance to add new plotting types.

Conclusion

This article should give you some sense of the full stack of applications required for cutting-edge science today. For the most part, we use off-the-shelf technologies written in Java and Python. Where new software has been identified, we attempt to encapsulate it into reusable projects so that the whole community can benefit.

We're in early days for software used for large experiments, and there are many areas for improvement. Lots of experiments are still configured by hand and only operated by machine. Software used for configuration, setup, and experimental automation will surely improve in the near future.

Additionally, you can see in the rest of industry that cloud technologies have been adopted. Instead of data being copied to an onsite data center, in the future there will be one or more science clouds available, where researchers can share large data sets with ease. Several clouds are already becoming popular for this, and future software platforms will integrate better with these data clouds.

The artificial intelligence (AI) revolution is also just beginning. Future scientific experiments will benefit from machine learning on past data, which can find existing datasets or predict outcomes. This will mean that accurate estimated results will be possible before an experiment is completed, though of course duplication is still required for science to work properly. Perhaps an AI with low-level hardware control will be able to work in concert with researchers. In my mind, the question is not whether AI will find its place in experimental laboratories, but how soon that will happen.

1 2 Page 2
Page 2 of 2