Artificial intelligence detects mileage patterns
Staying Normal
The regression only works if the training data was previously normalized for a constrained value range. If the script feeds the optimizer with the unmodified Unix seconds as the mileage date, the algorithm goes haywire and produces increasingly nonsensical values, until it finally breaks the boundaries of the hardware's floating-point math and sets all parameters to nan
(Not a Number).
Lines 31 to 37 in Listing 1 therefore normalize the training data by using pandas' min()
and max()
methods to find the minimum and the maximum timestamps, then subtract the minimum from all training values as an offset, and finally subdivide by the min-max difference.
This process normally results in training values between 0
and 1
(but caution if min
= max
), which the optimizer can process more efficiently.
With the learned parameters, it is now possible to reproduce historical values within the model's framework or predict the future. What mileage will the car have on June 2, 2019? The date has an epoch
value of 1559516400
, which the model has to normalize just as in the training case. The offset of 1486972800
, found as norm_off
in Figure 3, gets subtracted, and the input date is also divided by the scaling factor norm_mult
of 7686000
.
This results in an X
value of 9.43
, which is substituted into the formula
Y = X * W + b
to predict a mileage of 94,115 for June 2, 2019 – all assuming, of course, that the model is accurate (i.e., that the increase is indeed linear) and that the three months of training data are sufficient to determine the slope of the curve more or less accurately.
Keeping Back Data
To ensure that the model not only simulates the training data but also predicts the real future, AI specialists often break down the available data into a training and a test set. They train the model only with data from the training set; otherwise, the risk is that it will mimic the training data perfectly, including replicating any temporary outliers that do not occur later in production, causing the system to predict artifacts that are out of touch with reality.
If the test set remains untouched up to the end of the training runs and the model later also correctly predicts the test data, the AI system will most likely behave as expected later in a production environment.
Now, my 30-year-old HP-41CV pocket calculator was already able to determine the parameters W
and b
from a collection of X
/Y
values by assuming a linear relationship with a linear regression. However, TensorFlow can now do much more, because it also understands neural networks and decision trees, as well as more complex regression techniques.
No Simple Pattern
If you look at the daily mileage numbers closely, you will note that the increase is by no means precisely linear over time. Figure 4 shows the higher resolution mileage growth per day and illustrates that the rise is subject to huge fluctuations. For example, the car travels between 16 and 50 miles on most days, interrupted by a pause of two consecutive days every so often, with no increase in mileage at all.
A person simply looking at the graph in Figure 4 will immediately see that the car is driven less on weekends than on workdays. For an AI system to offer the same kind of intuitive performance, the programmer needs to take it by the hand and guide it in the right direction.
If the dates are, for example, stated in epoch seconds, as is common on Unix, the AI system will never in its lifetime find out that the weekend happens every seven days, with less driving as a result. A linear regression would only stretch the last few data points into the future; a polynomial regression would produce completely insane patterns in a mad bout of overfitting.
The learning algorithms are also bad at handling incomplete data. If there are no measured values for certain X
values, for example, on days when the car was only parked in the garage, the conscientious teacher needs to fill them with meaningful values (e.g., with zeros). Also, you need to add what is known as "expert knowledge" in the discipline of machine learning: Because the weekday of the date values is known and will hopefully help the algorithm, a new CSV file (miles-per-day-wday.csv
) simply provides the sequence number of the weekday (neural networks do not like strings, only numbers) for the daily mileage reading (Figure 5).
Listing 2 then uses the sklearn
framework to construct a neural network that it teaches to guess the associated day of the week based on the mileage. To do so, it first reads the CSV file and forms the data frame X
with the mileage numbers from it, and with y
as a vector containing the associated weekday numbers.
Listing 2
neuro.py
The train_test_split()
function splits the existing data into a training set and a test set, which the standard scaler
normalizes in lines 19 to 22 because neural networks are extremely meticulous as far as the value range of the input values is concerned.
The multilayer perceptron of type MLPClassifier
generated in lines 24 and 25 creates a neural network with two layers and stipulates that the training phase will be running for 1,000 steps at the most. Calling the fit()
method then triggers the teach-in, during which the optimizer tries to adjust the internal receptor weights in a bout of supervised learning, to evaluate the input until the error is minimized between the predicted value calculated from the training parameters and the anticipated value in y_train
.
The results were not all that exciting in the experiment, in part because the predicted values varied greatly from call to call, and the precision left something to be desired; yet, the neural network predicted the weekday from a given mileage in most cases. A variety of different input parameters would lead to better results.
With TensorFlow and SciKits, curious users have two sophisticated frameworks for experimentation with AI applications at their disposal. Getting started is anything but child's play because the literature [3] [4] on the latest features is still fairly recent and not very mature; also, a number of works are still in the development stage. However, it is worth exploring the matter, because this area of computer science undoubtedly has a bright future ahead of it.
Infos
- "Programming Snapshot – Driving Data" by Mike Schilli, Linux Pro Magazine, issue 202, September 2017, p. 50, http://www.linuxpromagazine.com/Issues/2017/202/Programming-Snapshot-Driving-Data
- Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/<issue no.>/
- Guido, Sarah, and Andreas C. Müller. Introduction to Machine Learning with Python. O'Reilly Media, 2016
- Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn and TensorFlow. O'Reilly Media, 2017
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.