Machine learning for Java developers, Part 2: Deploying your machine learning model

How to build and deploy a machine learning data pipeline in a Java-based production environment

1 2 3 Page 2
Page 2 of 3

Calling the pipeline fit() method as shown above trains all of the included transformers and the final model. Typically, the required raw training dataset is provided by a data acquisition component. This component collects data from a variety of sources and prepares the data for ingestion into the machine learning pipeline. For instance, the Housedata Ingestion component shown below encapsulates data sourcing and produces raw house and price data records, which are fed into the estimation pipeline.

jw grothml2 fig7 Gregor Roth

Figure 7. A flow diagram of the machine learning data pipeline

Internally, the Housedata Ingestion component may access a database storing sales transactions as well as other data sources such as a database storing geographical area data. Using an ingestion component separates the machine learning pipeline from the data source, so that changes in the data source will not impact the pipeline.

During the development process, different versions or variants of the pipeline may be trained and evaluated. For example, you might apply different thresholds to gradually weed outliers from the data. Working with machine learning data pipelines is a highly iterative process; it is common to test many pipeline versions or variants during development, eventually selecting the most consistently accurate pipeline for production usage.

Machine learning models in production

When you deploy the selected trained pipeline in production, you will be faced with new requirements. In order to manage production requirements such as reliability or maintainability, the packaging and deploying processes have to be reproducible. You should be able to re-package or re-deploy the pipeline with no change to its behaviors, even if the training data changes. You also have to be able to test or to rollback to older trained pipeline versions in case of erroneous system behavior in production.

Ensure your pipeline is reproducible

Ensuring that your machine learning pipeline is reproducible is easier said than done. Over time, your training dataset will change. It may increase in size as it gains more labeled data records, or it may decrease as data becomes unavailable due to external factors. Even if you use the exact same pipeline code, changes to your training dataset will produce different settings of the internal learnable pipeline parameters.

As an example, say you add a house record with a new MSZoning category, "A," which was not in the older dataset. In this case, although the transformation code is untouched, the internal CategoryToNumberTransformer map will include an additional entry for this new, unseen category:


{FV=1, RH=2, RM=3, C=5, …, RL=8, A=9, «default»=-1}

The newly trained pipeline's behavior differs from its previous iteration.

Use version control

To support reproducibility, pipeline code as well as trained instances must be under strict version control. According to a traditional software development process, the data ingestion should be versioned, released, and uploaded into a repository along with the untrained and trained pipeline components. Typically, you would use a build system such as Maven. In this case, we could store the results of the build-and-release process, the component binaries such as ingest-housedata-2.2.3.jar, and pipeline-estimate-houseprice-1.0.3.jar in a repository like JFrog's Artifactory or Sonatype's Nexus repository.

CI/CD in the machine learning data pipeline

Machine learning data pipelines and CI/CD pipelines are not the same. A machine learning data pipeline controls the data flow to transform input data into output data or predictions. A CI/CD pipeline is used to build, integrate, deliver, and deploy software artefacts in different stages. The diagram below illustrates the difference in the two types of pipeline.

jw grothml2 fig8 Gregor Roth

Figure 8. Data pipeline vs. CI/CD pipeline

If we wanted to integrate CI/CD into the machine learning data pipeline, we could build our JAR files artefacts during the CI/CD development stage. We could also extend the CI/CD pipeline to trigger the training process and provide the trained, serialized pipeline, which could then be deployed into the production environment.

As shown in Listing 4, the appropriate version of the ingestion and pipeline components would be loaded from the repository to train a production-ready pipeline. In this example, the downloaded executable JAR files contain the compiled Java classes, as well as a main class. When you execute ingest.jar, internally the Ames Housing dataset will be loaded and the raw house and price records files will be produced.

Listing 4. A script to train and upload a machine learning data pipeline in a CI/CD context


#!/bin/bash

# define the pipeline version to train
groupId=eu.redzoo.ml
artifactId=pipeline-estimate-houseprice
version=1.0.3

echo task 1: copying ingestion jar to local dir
ingest_app_uri="https://github.com/grro/ml_deploy/raw/master/example-repo/lib-releases/eu/redzoo/ml/ingestion-housedata/2.2.3/ingestion-housedata-2.2.3-jar-with-dependencies.jar"
curl -s -L $ingest_app_uri --output ingestion.jar

echo task 2: copying pipeline jar to local dir
pipeline_app_uri="https://github.com/grro/ml_deploy/raw/master/example-repo/lib-releases/${groupId//.//}/${artifactId//.//}/$version/${artifactId//.//}-$version-jar-with-dependencies.jar"
curl -s -L $pipeline_app_uri --output pipeline.jar

echo task 3: performing ingestion jar to produce houses.json and prices.json. Internally http://jse.amstat.org/v19n3/decock/AmesHousing.xls will be fetched
java -jar ingestion.jar train.csv houses.json prices.json

echo task 4: performing pipeline jar to create and train a pipeline consuming houses.json and prices.json
version_with_timestamp=$version-$(date +%s)
pipeline_instance=$artifactId-$version_with_timestamp.ser
java -jar pipeline.jar houses.json prices.json $pipeline_instance

echo task 5: uploading trained pipeline
echo curl -X PUT --data-binary "@$pipeline_instance" "https://github.com/grro/ml_deploy/blob/master/example-repo/model-releases/${groupId//.//}/${artifactId//.//}/$version_with_timestamp/$trained"

Note that most shops use a platform like Gitlab CI, TravisCI, CircleCI, Jenkins, or GoCD for CI/CD. All of these tools use a custom DSL (domain-specific language) to define CI/CD tasks. To keep the code examples simple, I've used bash scripts instead of tool-specific CI/CD task definitions for the code in Listing 4. When using a CI/CD platform, you would typically embed a stripped version of the example code within the CI/CD tasks.

After performing the ingest step shown in Listing 4 (task 3), the raw dataset files are used by the executable pipeline.jar to produce a trained pipeline instance. Internally, the pipeline's HousePricePipelineBuilder main class creates a new instance of the estimation pipeline. The newly created instance will be trained and serialized into an output file like pipeline-estimate-houseprice-1.0.3-1568611516.ser. This file contains the serialized state of the pipeline instance as a byte sequence and the names of the used Java classes.

To support reproducibility, the output filename includes the component version ID and a training timestamp. A new timestamp is generated for each training run. As a last step, the serialized trained pipeline file will be uploaded into a model repository.

Listing 5. Helper class to train a house price prediction pipeline


public class HousePricePipelineBuilder {

   public static void main(String[] args) throws IOException {
      new HousePricePipelineBuilder().train(args[0], args[1], args[2]);
   }

   public void train(String housesFilename, String pricesFilename, String instanceFilename) throws IOException {
      var houses = List.of(new ObjectMapper().readValue(new File(housesFilename), Map[].class));
      var prices = List.of(new ObjectMapper().readValue(new File(pricesFilename), Double[].class));

      var pipeline = newPipeline();
      pipeline.fit(houses, prices);
      pipeline.save(new File(instanceFilename));
   }

 public Pipeline<Object, Double> newPipeline() {
   return Pipeline.add(new DropNumericOutliners("LotArea", 10))
         .add(new AddMissingValuesTransformer())
         .add(new CategoryToNumberTransformer())
         .add(new AddComputedFeatureTransformer())
         .add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
         .add(new HousePriceModel());
 }
}

Deployment: REST and Docker in the machine learning data pipeline

In order to make your newly trained pipeline instance available to end users and other systems, you will have to make it available in a production environment. How you integrate the trained pipeline into the production environment will strongly depend on your target infrastructure, which could be a datacenter, an IoT device, a mobile device, etc.

As one example, integrating the pipeline into a classic batch-oriented big data production environment requires providing a batch interface to train machine learning models and perform predictions. In a batch-oriented approach you would process your data in bulk using shared databases or filesystems like Hadoop.

In most cases, a pipeline can be trained offline, so a batch-oriented approach is often used for this purpose. For example, I used the batch-oriented approach for the HousePricePipelineBuilder, where input files are read from the filesystem. The downside of this approach is the time delay. In batch processing, data records are collected over a period of time and then processed together, all at once.

In contrast to training, processing a trained pipeline in production often requires a more real-time approach. Processing incoming data as it arrives means that predictions will be available immediately, without delay. To support real-time requirements, you could extend a big data infrastructure like Hadoop with a messaging or streaming platform like Apache Kafka. In this case, the pipeline would have to be connected to the streaming system and listen for incoming records.

Machine learning with REST

An alternative to streaming or messaging would be to use an RPC-based infrastructure. Instead of consuming incoming records from a stream, in this case the pipeline listens for incoming remote calls such as HTTP requests. The machine learning pipeline will be accessed via a REST interface, as shown in the example below. Here, a minimal REST service handles incoming HTTP requests and uses the trained pipeline instance to perform predictions and send the HTTP response message. The trained pipeline instance will be loaded during the REST service's initialization procedure. To be able to deserialize the pipeline, its classes have to be available in the REST service's classpath.

Listing 6. A REST interface for the machine learning pipeline


@SpringBootApplication
@RestController
public class RestfulEstimator {
   private final Estimator estimator;

   RestfulEstimator(@Value("${filename}") String pipelineInstanceFilename) throws IOException  {
      this.estimator = Pipeline.load(new ClassPathResource(pipelineInstanceFilename).getInputStream());
   }

   @RequestMapping(value = "/predictions", method = RequestMethod.POST)
   public List<Object> batchPredict(@RequestBody ArrayList<HashMap<String, Object>> records) {
      return estimator.predict(records);
   }

   public static void main(String[] args) {
      SpringApplication.run(RestfulEstimator.class, args);
   }
}

Typically, all artifacts required to run the server are packaged within a server JAR file. A server JAR file such as a server-pipeline-estimate-houseprice-1.0.3-1568611516.jar could include the pipeline-estimate-houseprice-1.0.3.jar, the serialized pipeline pipeline-estimate-houseprice-1.0.3-1568611516.ser, and all required third-party libraries.

To build such an executable server jar file, you could use a CI/CD pipeline as shown in Listing 7. The simplified bash script clones the source code of the generic REST service and adds the dependencies of the Houseprice pipeline, as well as the serialized, trained pipeline file. In this case, the Maven build tool is used to compile and package the code. Maven resolves and merges the party library dependencies of the generic REST server and the Houseprice pipeline during the build, making it easier to detect and avoid version conflicts between the generic REST server code and the pipeline code.

Note that the bash script below includes an additional step after providing the executable server JAR file. Note that a Docker container image is built in task 6. The script provides an executable server JAR file as well as a Docker container image.

1 2 3 Page 2
Page 2 of 3