Machine learning with Python: An introduction

Find out why developers are choosing Python for machine learning--with and without Java

1 2 Page 2
Page 2 of 2

Listing 3. Estimator.py


import pickle
from sklearn.linear_model import LinearRegression
class HousePriceEstimator:
    def __init__(self):
        self.model = LinearRegression()
    def extract_core_features_list(self, house_list):
        features_list = []
        for house in house_list:
            features_list.append(self.extract_core_features(house))
        return features_list
    def extract_core_features(self, house):
        # returns the most relevant features as numeric values
        return [house['MSSubClass'],
                house['LotArea'],
                house['OverallQual'],
                house['OverallCond'],
                int(house['YrSold'])]
    def train(self, house_list, price_list):
        features_list = self.extract_core_features_list(house_list)
        self.model.fit(features_list, price_list)
    def predict_prices(self, house_list):
        features_list = self.extract_core_features_list(house_list)
        return self.model.predict(features_list)  # returns the predicted price_list
    def save(self, model_filename):
        pickle.dump(self.model, open(model_filename, 'wb'))
    @staticmethod
    def load(model_filename):
        estimator = HousePriceEstimator()
        estimator.model = pickle.load(open(model_filename, 'rb'))
        return estimator

The extract_core_features() method is part of the HousePriceEstimator class. This class encapsulates the machine learning model by providing methods to support model training and prediction, as well as saving and restoring the model. Internally, the model uses the Scikit-learn library, which is a collection of advanced machine learning algorithms for Python.

Instantiating a new HousePriceEstimator creates a new LinearRegression model instance. The model will then be trained by executing the train() method  along with a list of houses and their associated prices. The internal model parameter values of the model instance will be adjusted based on the training data. After the training process, HousePriceEstimator is able to estimate house prices by executing the predict_price() method.

Testing models with Jupyter Notebook

You can use Jupyter notebook to explore data sets, and you can also use it to develop and test machine learning models. In Figure 4, a scatter plot gives a visual overview of the relationship between real and predicted housing prices. The closer you are to the diagonal line the more accurate the predicted price will be. In practice, such visualizations will be enriched by numeric metrics such as accuracy or precision values.

mlpython fig4 Gregor Roth

Figure 4. Validating the quality of a model

Machine learning in production

Once you've finalized and trained a model, you will want to deploy it into a production environment. We often use different tools to meet different requirements in the development and production environments. Production environments are driven by requirements such as reliability and scalability, whereas tools and systems in a development environment are more focused on facilitating thinking and experimentation. Jupyter Notebook is mostly used in the development stage of the machine learning lifecycle. A runtime environment is required to process the model in production.

Different lifecycle stages also have different runtime performance requirements. As one example, the process of training a production-ready model typically requires considerable computation power. It isn't uncommon for complex algorithms, such as deep neural networks, to include millions of model parameters, all of which have to be trained. Python's execution speed is significantly slower than C/C++ or Java's, but this does not mean using Python in production is inefficient. Python-based machine learning libraries like Scikit-learn get around performance issues by relying on lower level packages written in C/C++ or Fortran, which are capable of more efficient computations.

Other popular machine learning frameworks and libraries such as TensorFlow are mostly written in highly-optimized C++, but provide Python language bindings.

Cloud solutions for machine learning

Cloud integrations with TensorFlow or Scikit-learn make it possible to run trained models in a cloud environment. For instance, Google's Cloud Machine Learning (ML) Engine offers a web service for handling HTTP prediction requests, which can then be forwarded to a trained machine learning model.

Figure 5 shows a web-based approach to performing an online prediction.

mlpython fig5 Gregor Roth

Figure 5. Performing online predictions

Built-in support from Google's Cloud ML Engine is currently restricted to a few popular machine learning libraries. The platform is also connected with other Google cloud services.

Mixing Python and Java for machine learning

Both Java and Python have strengths, so you might consider the advantages of a mixed environment for machine learning.

Jython, a Python interpreter written in Java, is one option for bridging the two programming language ecosystems. Jython enables you to call Python code in Java-based web services within the same JVM. Unfortunately, Jython is severely limited for machine learning: First, the current Jython implementation does not support Python 3.x, and currently goes only to Python 2.7. Much worse, Jython doesn't support running libraries like NumPy or SciPy. That's a real problem because many Python-based tools for machine learning depend on such libraries.

Instead of running Python code on the JVM, you might consider using Java's ProcessBuilder to execute Python code from Java. Essentially, ProcessBuilder lets you build and interact with Python processes from a Java-based web service, making it possible to run a trained machine learning model inside the Python process. Implementing this approach in a reliable and performant way requires considerable experience, however. You will need to reduce the overhead of creating a new Python processes for each prediction request, and you will need a way to synchronize your Java-code updates with Python-code updates, especially when the interaction protocol between the Java and Python processes changes.

An alternative to ProcessBuilder is using JNI/JNA to call Python from Java, in this case calling the Python code via a C++ bridge. A popular library implementing this approach is JPy. Like ProcessBuilder, a solution based on JNI/JNA also requires special attention to performance and reliability.

Interoperable solutions for Python and Java

Some interoperable models allow you to use Python to develop your machine learning models and process them using Java-based libraries like Eclipse's Deeplearning4j. Using Deeplearning4j, you would run Keras-based models, which are written in Python.

Another approach is to use an interoperability standard such as PMML (Predictive Model Markup Language) or ONNX (Open Neural Network Exchange Format). These provide a standard format to represent machine learning models, so that you can share models between machine learning libraries. An advantage of the standards-based approach is that it allows you to switch to another machine learning stack within the production environment.

Support for interoperability is limited for feature extraction, however, where one of the first steps is to pre-process the raw input data. In most cases, large parts of the pre-processing code are not covered by interoperability standards or models, so you are forced to reimplement feature extraction code whenever the language environment is switched. Reimplementing that code is not trivial, especially because it is often the major portion of the application code for a machine learning project.

Using Java by itself

Switching between machine learning libraries and programming languages to train and process models poses a variety of restrictions and difficulties, depending on the approach you choose. The simplest approach, by far, is to stay with a single machine learning library and programming language throughout the machine learning process.

As I demonstrated in my previous article, it is entirely possible to use Java for all phases of machine learning. Weka is a strong, Java-based framework that provides graphical interfaces and execution environments, as well as library functions to explore data and build models. Eclipse Deeplearning4j (DL4J) is built for Java or Scala, and it integrates with Hadoop and Apache Spark.

A Java-only approach likely makes the most sense for shops that are strongly committed to the Java stack and able to employ Java-skilled data scientists. Another option is to use Python for both the development and production phases of the machine learning lifecycle.

Building a Python-based web service for machine learning

In this section we'll quickly explore a pure, Python-based solution for machine learning. To start, we build a Python-based web service:

Listing 4. estimator_service.py


from datetime import datetime
from flask import Flask, request, jsonify
from flask_restful import Resource, Api
from estimator import HousePriceEstimator
class HousePriceEstimatorResource(Resource): 
    def __init__(self, estimator):
        self.estimator = estimator
    def post(self):
        house = request.get_json()  # parses the request payload as json
        predicted_price = self.estimator.predict_prices([house])[0]
        return jsonify(price=predicted_price, date=datetime.now())
estimator = HousePriceEstimator.load('model.p')
estimator_app = Flask(__name__)  # creates the application object used by the WSGI server
api = Api(estimator_app)
api.add_resource(HousePriceEstimatorResource, '/predict', resource_class_kwargs={'price_estimator': estimator})

In this example, we've used the Python microframework Flask to implement a RESTful service, which accepts house price prediction requests and forwards them to the estimator. The RESTful extension of Flask provides a Resource base class which is used to define the handling of HTTP methods for a given resource URL. The HousePriceEstimatorResource class extends the Resource base class by implementing the post() method of the /predict endpoint. By executing the post() method, the received house data record is used to call the HousePriceEstimator's predict_prices() method. The estimate price will then be returned by adding a prediction timestamp, as shown below.

Listing 5. Curl-based example call


root@2ec2dc33c182:/datascience# curl -H'Content-Type: application/json' -d'{"MSSubClass": 60, "LotArea": 8450, "MSZoning": "RL", "LotShape":"Reg", "Neighborhood":"CollgCr", "OverallQual": 67, "OverallCond": 55, "YrSold": 56}' -XPOST http://127.0.0.1:8080/predict
{
  "date": "Thu, 20 Dec 2018 06:40:50 GMT",
  "price": 3606324.4901064923
}

The Python code after the HousePriceEstimatorResource class definition loads the serialized machine-learning model and uses it to provide a Flask application instance called estimator_app. We can then run the estimator_service.py service using the the WSGI Gunicorn server, as shown below. The server will be started with the Python module name (which is in this case the Python source filename without the ".py" extension) and the variable name containing the app.

Listing 6. Running the Python web service


root@2ec2dc33c182:/datascience# gunicorn --bind 0.0.0.0:8099 estimator_service:estimator_app
[2018-11-11 13:18:05 +0000] [6] [INFO] Starting gunicorn 19.9.0
[2018-11-11 13:18:05 +0000] [6] [INFO] Listening at: http://0.0.0.0:8099 (6)
[2018-11-11 13:18:05 +0000] [6] [INFO] Using worker: sync
[2018-11-11 13:18:05 +0000] [9] [INFO] Booting worker with pid: 9

Similar to the Java Servlet specification, Python's WSGI standard defines an interface between web servers and Python web applications. In contrast to the Java Servlet API, the programming model of a Python web application is still framework specific. As an example, the naming of the post() method is defined by the Flask framework and is not part of the WSGI spec.

The WSGI interface gives you the flexibility to choose the best-fitting WSGI web server for your environment. One option is to use the lightweight Python-based Gunicorn server only for development activities. In production, you could use something more flexible, like Nginx/uWSGI. To deploy the code on production nodes you could build a Docker container image, which would include the WSGI web server, the application code, and the required libraries and serialized model file. The Docker image would provide a web-based microservice providing house price predictions.

mlpython fig6 Gregor Roth

Figure 6. Estimator docker container image

In conclusion

As I demonstrated in my previous article, it is entirely possible to use Java for all phases of machine learning: exploration, data preparation, modeling, evaluation, and deployment. Weka is a rich, Java-based framework that provides graphical interfaces and execution environments, and for deep, neural network machine learning models you could use Deeplearning4j. There is even an integration with Weka that allows you to use models developed with Deeplearning4j in the Weka workbench.

For shops that can be more flexible, there are many advantages to choosing Python. Data scientists are more likely to be familiar with Python and its tools ecosystem, and developers taking on the role of a data scientist will benefit from the many useful and efficient Python-based machine learning tools. Tools and libraries such as Jupyter, pandas, or Scikit-learn will move you through the process of analytics and model development, reducing iteration time from data analysis, data preparation, and model building to deploying your model in production.

Once your model is in production, you have a variety of options for deploying it: you may experiment with a mixed environment that supports a Java backend, move your Python-based application to the cloud, or develop your own RESTful Python-based web service or microservice.

1 2 Page 2
Page 2 of 2