For instance, the real price of the house with size of 1330 is 6,500,000 €. In contrast, the predicted house price of the trained target function is 7,032,478 €: a gap (or error) of 532,478 €. You can also find this gap in the chart above. The gap (or error) is shown as a vertical dotted red line for each training price-size pair.

To compute the *cost* of the trained target function, you must summarize the squared error for each house in the example and calculate the mean value. The smaller the cost value of J(θ), the more precise the target function's predictions will be.

In Listing 3, the simple Java implementation of the cost function takes as input the target function, the list of training records, and their associated labels. The predicted value will be computed in a loop, and the error will be calculated by subtracting the real label value.

Afterward, the squared error will be summarized and the mean error will be calculated. The cost will be returned as a double value:

```
public static double cost(Function<Double[], Double> targetFunction,
List<Double[]> dataset,
List<Double> labels) {
int m = dataset.size();
double sumSquaredErrors = 0;
// calculate the squared error ("gap") for each training example and add it to the total sum
for (int i = 0; i < m; i++) {
// get the feature vector of the current example
Double[] featureVector = dataset.get(i);
// predict the value and compute the error based on the real value (label)
double predicted = targetFunction.apply(featureVector);
double label = labels.get(i);
double gap = predicted - label;
sumSquaredErrors += Math.pow(gap, 2);
}
// calculate and return the mean value of the errors (the smaller the better)
return (1.0 / (2 * m)) * sumSquaredErrors;
}
```

## Training the target function

Although the cost function helps to evaluate the quality of the target function and theta parameters, respectively, you still need to compute the best-fitting theta parameters. You can use the *gradient descent *algorithm for this calculation.

### Gradient descent

Gradient descent minimizes the cost function, meaning that it's used to find the theta combinations that produces the *lowest cost* (J(θ)) based on the training data.

Here is a simplified algorithm to compute new, better fitting thetas:

Within each iteration a new, better value will be computed for each individual θ parameter of the theta vector. The learning rate α controls the size of the computing step within each iteration. This computation will be repeated until you reach a theta values combination that fits well. As an example, the linear regression function below has three theta parameters:

Within each iteration a new value will be computed for each theta parameter: θ_{0}, θ_{1}, and θ_{2} in parallel. After each iteration, you will be able to create a new, better-fitting instance of the `LinearRegressionFunction`

by using the new theta vector of `{θ`

._{0}, θ_{1}, θ_{2}}

Listing 4 shows Java code for the gradient descent algorithm. The thetas of the regression function will be trained using the training data, data labels, and the learning rate (α). The output of the function is an improved target function using the new theta parameters. The `train()`

method will be called again and again, and fed the new target function and the new thetas from the previous calculation. These calls will be repeated until the tuned target function's cost reaches a minimal plateau:

```
public static LinearRegressionFunction train(LinearRegressionFunction targetFunction,
List<Double[]> dataset,
List<Double> labels,
double alpha) {
int m = dataset.size();
double[] thetaVector = targetFunction.getThetas();
double[] newThetaVector = new double[thetaVector.length];
// compute the new theta of each element of the theta array
for (int j = 0; j < thetaVector.length; j++) {
// summarize the error gap * feature
double sumErrors = 0;
for (int i = 0; i < m; i++) {
Double[] featureVector = dataset.get(i);
double error = targetFunction.apply(featureVector) - labels.get(i);
sumErrors += error * featureVector[j];
}
// compute the new theta value
double gradient = (1.0 / m) * sumErrors;
newThetaVector[j] = thetaVector[j] - alpha * gradient;
}
return new LinearRegressionFunction(newThetaVector);
}
```

To validate that the cost decreases continuously, you can execute the cost function J(θ) after each training step. With each iteration, the cost must decrease. If it doesn't, then the value of the learning rate parameter is too large, and the algorithm will shoot past the minimum value. In this case the gradient descent algorithm fails.

The diagram below shows the target function using the computed, new theta parameter, starting with an initial theta vector of `{ 1.0, 1.0 }`

. The left-side column shows the prediction graph after 50 iterations; the middle column after 200 iterations; and the right column after 1,000 iterations. As you see, the cost decreases after each iteration, as the new target function fits better and better. After 500 to 600 iterations the theta parameters no longer change significantly and the cost reaches a stable plateau. The accuracy of the target function will no longer significantly improve from this point.

In this case, although the cost will no longer decrease significantly after 500 to 600 iterations, the target function is still not optimal; it seems to *underfit*. In machine learning, the term *underfitting* is used to indicate that the learning algorithm does not capture the underlying trend of the data.

Based on real-world experience, it is expected that the the price per square metre will *decrease* for larger properties. From this we conclude that the model used for the training process, the target function, does not fit the data well enough. Underfitting is often due to an excessively simple model. In this case, it's the result of our simple target function using a single house-size feature only. That data alone is not enough to accurately predict the cost of a house.

## Adding features and feature scaling

If you discover that your target function doesn't fit the problem you are trying to solve, you can adjust it. A common way to correct underfitting is to add more features into the feature vector.

In the housing-price example, you could add other house characteristics such as the number of rooms or age of the house. Rather than using the single domain-specific feature vector of `{ size }`

to describe a house instance, you could usea multi-valued feature vector such as `{ size, number-of-rooms, age }`

.

In some cases, there aren't enough features in the available training data set. In this case, you can try adding polynomial features, which are computed by existing features. For instance, you could extend the house-price target function to include a computed squared-size feature (x_{2}):

Using multiple features requires *feature scaling*, which is used to standardize the range of different features. For instance, the value range of `size`

feature is a magnitude larger than the range of the size feature. Without feature scaling, the ^{2}`size`

feature will dominate the cost function. The error value produced by the ^{2}`size`

feature will be much higher than the error value produced by the size feature. A simple algorithm for feature scaling is:^{2}

This algorithm is implemented by the `FeaturesScaling`

class in the example code below. The `FeaturesScaling`

class provides a factory method to create a scaling function adjusted on the training data. Internally, instances of the training data are used to compute the average, minimum, and maximum constants. The resulting function consumes a feature vector and produces a new one with scaled features. The feature scaling is required for the training process, as well as for the prediction call, as shown below:

```
// create the dataset
List<Double[]> dataset = new ArrayList<>();
dataset.add(new Double[] { 1.0, 90.0, 8100.0 }); // feature vector of house#1
dataset.add(new Double[] { 1.0, 101.0, 10201.0 }); // feature vector of house#2
dataset.add(new Double[] { 1.0, 103.0, 10609.0 }); // ...
//...
// create the labels
List<Double> labels = new ArrayList<>();
labels.add(249.0); // price label of house#1
labels.add(338.0); // price label of house#2
labels.add(304.0); // ...
//...
// scale the extended feature list
Function<Double[], Double[]> scalingFunc = FeaturesScaling.createFunction(dataset);
List<Double[]> scaledDataset = dataset.stream().map(scalingFunc).collect(Collectors.toList());
// create hypothesis function with initial thetas and train it with learning rate 0.1
LinearRegressionFunction targetFunction = new LinearRegressionFunction(new double[] { 1.0, 1.0, 1.0 });
for (int i = 0; i < 10000; i++) {
targetFunction = Learner.train(targetFunction, scaledDataset, labels, 0.1);
}
// make a prediction of a house with size if 600 m2
Double[] scaledFeatureVector = scalingFunc.apply(new Double[] { 1.0, 600.0, 360000.0 });
double predictedPrice = targetFunction.apply(scaledFeatureVector);
```

As you add more and more features, you may find that the target function fits better and better--but beware! If you go too far, and add too many features, you could end up with a target function that is *overfitting*.

### Overfitting and cross-validation

Overfitting occurs when the target function or model fits the training data *too well*, by capturing noise or random fluctuations in the training data. A pattern of overfitting behavior is shown in the graph on the far-right side below:

Although an overfitting model matches very well on the training data, it will perform badly when asked to solve for unknown, unseen data. There are a few ways to avoid overfitting.

- Use a larger set of training data.
- Use an improved machine learning algorithm by considering regularization.
- Use fewer features, as shown in the middle diagram above.

If your predictive model overfits, you should remove any features that do not contribute to its accuracy. The challenge here is to find the features that contribute most meaningfully to your prediction output.

As shown in the diagrams, overfitting can be identified by visualizing graphs. Even though this works well using two dimensional or three dimensional graphs, it will become difficult if you use more than two domain-specific features. This is why cross-validation is often used to detect overfitting.

In a cross-validation, you evaluate the trained models using an unseen validation data set after the learning process has completed. The available, labeled data set will be split into three parts:

- The
*training data set*. - The
*validation data set.* - The
*test data set.*

In this case, 60 percent of the house example records may be used to train different variants of the target algorithm. After the learning process, half of the remaining, untouched example records will be used to validate that the trained target algorithms work well for unseen data.

Typically, the best-fitting target algorithms will then be selected. The other half of untouched example data will be used to calculate error metrics for the final, selected model. While I won't introduce them here, there are other variations of this technique, such as* k fold cross-validation.*

## Machine learning tools and frameworks: Weka

As you've seen, developing and testing a target function requires well-tuned configuration parameters, such as the proper learning rate or iteration count. The example code I've shown reflects a very small set of the possible configuration parameters, and the examples have been simplified to keep the code readable. In practice, you will likely rely on machine learning frameworks, libraries, and tools.

Most frameworks or libraries implement an extensive collection of machine learning algorithms. Additionally, they provide convenient high-level APIs to train, validate, and process data models. Weka is one of the most popular frameworks for the JVM.

Weka provides a Java library for programmatic usage, as well as a graphical workbench to train and validate data models. In the code below, the Weka library is used to create a training data set, which includes features and a label. The `setClassIndex()`

method is used to mark the label column. In Weka, the label is defined as a *class*:

```
// define the feature and label attributes
ArrayList<Attribute> attributes = new ArrayList<>();
Attribute sizeAttribute = new Attribute("sizeFeature");
attributes.add(sizeAttribute);
Attribute squaredSizeAttribute = new Attribute("squaredSizeFeature");
attributes.add(squaredSizeAttribute);
Attribute priceAttribute = new Attribute("priceLabel");
attributes.add(priceAttribute);
// create and fill the features list with 5000 examples
Instances trainingDataset = new Instances("trainData", attributes, 5000);
trainingDataset.setClassIndex(trainingSet.numAttributes() - 1);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 90.0);
instance.setValue(squaredSizeAttribute, Math.pow(90.0, 2));
instance.setValue(priceAttribute, 249.0);
trainingDataset.add(instance);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 101.0);
...
```

The data set or Instance object can also be stored and loaded as a file. Weka uses an ARFF (Attribute Relation File Format), which is supported by the graphical Weka workbench. This data set is used to train the target function, known as a *classifier* in Weka.