## Artificial intelligence detects mileage patterns

#### Staying Normal

The regression only works if the training data was previously normalized for a constrained value range. If the script feeds the optimizer with the unmodified Unix seconds as the mileage date, the algorithm goes haywire and produces increasingly nonsensical values, until it finally breaks the boundaries of the hardware's floating-point math and sets all parameters to `nan`

(Not a Number).

Lines 31 to 37 in Listing 1 therefore normalize the training data by using pandas' `min()`

and `max()`

methods to find the minimum and the maximum timestamps, then subtract the minimum from all training values as an offset, and finally subdivide by the min-max difference.

This process normally results in training values between `0`

and `1`

(but caution if `min`

= `max`

), which the optimizer can process more efficiently.

With the learned parameters, it is now possible to reproduce historical values within the model's framework or predict the future. What mileage will the car have on June 2, 2019? The date has an `epoch`

value of `1559516400`

, which the model has to normalize just as in the training case. The offset of `1486972800`

, found as `norm_off`

in Figure 3, gets subtracted, and the input date is also divided by the scaling factor `norm_mult`

of `7686000`

.

This results in an `X`

value of `9.43`

, which is substituted into the formula

Y = X * W + b

to predict a mileage of 94,115 for June 2, 2019 – all assuming, of course, that the model is accurate (i.e., that the increase is indeed linear) and that the three months of training data are sufficient to determine the slope of the curve more or less accurately.

#### Keeping Back Data

To ensure that the model not only simulates the training data but also predicts the real future, AI specialists often break down the available data into a training and a test set. They train the model only with data from the training set; otherwise, the risk is that it will mimic the training data perfectly, including replicating any temporary outliers that do not occur later in production, causing the system to predict artifacts that are out of touch with reality.

If the test set remains untouched up to the end of the training runs and the model later also correctly predicts the test data, the AI system will most likely behave as expected later in a production environment.

Now, my 30-year-old HP-41CV pocket calculator was already able to determine the parameters `W`

and `b`

from a collection of `X`

/`Y`

values by assuming a linear relationship with a linear regression. However, TensorFlow can now do much more, because it also understands neural networks and decision trees, as well as more complex regression techniques.

#### No Simple Pattern

If you look at the daily mileage numbers closely, you will note that the increase is by no means precisely linear over time. Figure 4 shows the higher resolution mileage growth per day and illustrates that the rise is subject to huge fluctuations. For example, the car travels between 16 and 50 miles on most days, interrupted by a pause of two consecutive days every so often, with no increase in mileage at all.

A person simply looking at the graph in Figure 4 will immediately see that the car is driven less on weekends than on workdays. For an AI system to offer the same kind of intuitive performance, the programmer needs to take it by the hand and guide it in the right direction.

If the dates are, for example, stated in epoch seconds, as is common on Unix, the AI system will never in its lifetime find out that the weekend happens every seven days, with less driving as a result. A linear regression would only stretch the last few data points into the future; a polynomial regression would produce completely insane patterns in a mad bout of overfitting.

The learning algorithms are also bad at handling incomplete data. If there are no measured values for certain `X`

values, for example, on days when the car was only parked in the garage, the conscientious teacher needs to fill them with meaningful values (e.g., with zeros). Also, you need to add what is known as "expert knowledge" in the discipline of machine learning: Because the weekday of the date values is known and will hopefully help the algorithm, a new CSV file (`miles-per-day-wday.csv`

) simply provides the sequence number of the weekday (neural networks do not like strings, only numbers) for the daily mileage reading (Figure 5).

Listing 2 then uses the `sklearn`

framework to construct a neural network that it teaches to guess the associated day of the week based on the mileage. To do so, it first reads the CSV file and forms the data frame `X`

with the mileage numbers from it, and with `y`

as a vector containing the associated weekday numbers.

Listing 2

neuro.py

The `train_test_split()`

function splits the existing data into a training set and a test set, which the standard `scaler`

normalizes in lines 19 to 22 because neural networks are extremely meticulous as far as the value range of the input values is concerned.

The multilayer perceptron of type `MLPClassifier`

generated in lines 24 and 25 creates a neural network with two layers and stipulates that the training phase will be running for 1,000 steps at the most. Calling the `fit()`

method then triggers the teach-in, during which the optimizer tries to adjust the internal receptor weights in a bout of supervised learning, to evaluate the input until the error is minimized between the predicted value calculated from the training parameters and the anticipated value in `y_train`

.

The results were not all that exciting in the experiment, in part because the predicted values varied greatly from call to call, and the precision left something to be desired; yet, the neural network predicted the weekday from a given mileage in most cases. A variety of different input parameters would lead to better results.

With TensorFlow and SciKits, curious users have two sophisticated frameworks for experimentation with AI applications at their disposal. Getting started is anything but child's play because the literature [3] [4] on the latest features is still fairly recent and not very mature; also, a number of works are still in the development stage. However, it is worth exploring the matter, because this area of computer science undoubtedly has a bright future ahead of it.

Infos

- "Programming Snapshot – Driving Data" by Mike Schilli,
*Linux Pro Magazine*, issue 202, September 2017, p. 50, http://www.linuxpromagazine.com/Issues/2017/202/Programming-Snapshot-Driving-Data - Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/<issue no.>/
- Guido, Sarah, and Andreas C. Müller.
*Introduction to Machine Learning with Python*. O'Reilly Media, 2016 - Géron, Aurélien.
*Hands-On Machine Learning with Scikit-Learn and TensorFlow*. O'Reilly Media, 2017

« Previous 1 2

## Buy this article as PDF

(incl. VAT)