Indoor navigation with machine learning

Data Cleanup

If there is a clear relationship between the properties and the target variable, the data is redundant. Unsupervised learning finds out if the redundancy is broken (e.g., due to a write error when acquiring the data).

Line 1 in Listing 15 transfers the input data to a pandas DataFrame, and line 2 assigns the data to one of the four clusters. The anonymous values   through 3 correspond to the rooms encountered during the supervised learning example. But which four rooms are identified?

Listing 15

Data Cleanup

01 dfu = pd.DataFrame(Xu, columns = [0, 1, 2, 3, 4, 5, 6])
02 dfu['Target'] = kmeans.predict(Xu)
03 kList = classifier.predict(clusterCenters)
04 transD = {i: el for i, el in enumerate(kList)}
05 dfu['Target'] = dfu['Target'].map(transD)

Line 3 in Listing 15 uses the Random Forest classifier classifier (trained in the supervised learning example) to transform the four K-Means cluster focal points back to the target values from supervised learning: the identifiers for the four rooms. Line 4 prepares a dictionary for translation using map in line 5 to replace the numbers with room names.

Now I can compare the values: Does the mapping of the rooms from the source data match the clusters that K-Means found? To do this, I add an additional column Targetu to the DataFrame object in Listing 16. The new DataFrame object dfgroup takes only the values that differ in the target columns. Line 5 counts the differences.

Listing 16

Room Assignments Source Data

:dfDu = df.copy()
dfDu['Targetu']= dfu['Target']
dfDu[dfDu['Target'] != dfDu ['Targetu']].iloc[:,-2:]
dfgroup = dfDu[dfDu['Target'] != dfDu ['Targetu']].iloc[:,-2:]
dfgroup.groupby(['Target', 'Targetu'])['Targetu'].count()

Listing 17 shows the output from Listing 16. K-Means recognizes that the living room is a better fit than the hallway in 75 cases and better than the kitchen in four cases. I already found in supervised learning that the hallway was interpreted as the living room eight times.

Listing 17

Room Assignments Output

Target        Targetu
Hallway       Living_room    75
Kitchen       Living_room     4
Patio         Kitchen         2
              Living_room     2
Living_room   Kitchen         2
              Patio           6

Further suggestions to clean up the data are only hinted at here. In my example, errors in the assignments only occur in neighboring positions. It is particularly difficult to distinguish between the hallway and the living room and, to a lesser extent, between the living room and the patio. Little points to errors caused by carelessness (i.e., completely wrongly assigned rooms).

In addition, you could consider the decision-making statistics from supervised learning and remove the unclear values. This improves the classifier's learning ability. For the evaluation, you would define a threshold to create more categories. For example, the data is then classified as "probably living room or hallway" rather than a supposedly unambiguous but actually uncertain statement.

Reducing the Dimensions

Two-dimensional diagrams show the dependence of two parameters, and three parameters span a three-dimensional space. The fourth dimension is often illustrated by a time stamp on consecutive diagrams. In looking for Tom, I am dealing with seven component values. In order to be able to show the dependency on a target value, I picked out two component values earlier.

I do this cautiously. From supervised learning, I know the components with the highest prioritization. PCA condenses the feature's information and reduces the dimensions without knowing the target values. It is a powerful approach that also detects outliers. I have limited myself to a few use cases.

Listing 18 turns out to be largely self-explanatory. After importing the PCA library, the code reduces the number of components, in this case from seven to seven – so nothing is gained initially. However, the method returns the explained_variance_ratio_ attribute, and the cumsum function returns the cumulative sum. Figure 14 shows that already a single component returns 65 percent of the results correctly, and two components even return 85 percent.

Listing 18

Principle Component Analyis

from sklearn.decomposition import PCA
pca_7 = PCA(n_components=7)
pca_7.fit(Xu)
x = list(range(1,8))
plt.grid()
plt.plot(x, np.cumsum(pca_7.explained_variance_ratio_ * 100))
plt.xlabel('Number of components')
plt.ylabel('Explained variance')
plt.show()
Figure 14: Component meanings.

Given clusters = 4 (i.e., four clusters), Listing 19 outputs a Voronoi diagram in Figure 15, which generates a cluster distribution similar to Figure 10 without knowledge of the prioritized components. The axes have lost their old meaning. In my example, seven signal strengths give rise to two new scales with new metrics. The slightly distorted point clouds are mirror images.

Listing 19

Voronoi Diagram

from scipy.spatial import Voronoi, voronoi_plot_2d
from matplotlib import cm
x1, x2 = 1, 0
# clusters = 4
clusters = 16
Xur = pca_2_reduced
kmeansp = KMeans(n_clusters=clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeansp.fit(Xur)
y_pred = kmeansp.predict(Xur)
ccp = kmeansp.cluster_centers_[:,[x1, x2]]
fig, ax1 = plt.subplots(figsize=(4,5), dpi=120)
vor = Voronoi(ccp)
voronoi_plot_2d(vor, ax = ax1, line_width = 1)
plt.scatter(Xur[:,x1], Xur[:,x2], s= 3, c=y_pred, cmap = plt.get_cmap('viridis'))
plt.scatter(ccp[:, 0], ccp[:, 1], s=150, c='red', marker = 'X')
for i, p in enumerate(ccp):
  plt.annotate(f'$\\bf{i}$', (p[0]+1, p[1]+2))
Figure 15: Voronoi diagram with four clusters.

While decision trees do not require metrics, K-Means and PCA compare distances between datapoints. Typically, preprocessing raises the scales to a comparable level. The error by omission factor remains relatively small at this point because the signal strengths of all attributes are of a similar magnitude.

Figure 16 illustrates that preprocessing plays an important role in estimating the number of clusters. Changing just one variable, clusters = 18, paints a whole new picture. Because of the Voronoi cells [7] and the coloring, the diagram looks quite convincing, but it doesn't tell me where Tom is located.

Figure 16: Voronoi plot with 18 clusters.

Conclusions

It is impossible to calculate Tom's location using an analytical solution, mainly due to indoor obstacles weakening the signal and providing varying results. To find Tom, I instead relied on machine learning methods. The methods discussed in this article use weak artificial intelligence. So far, I have not seen any approaches from artificial intelligence (i.e., self-reflecting systems).

With supervised machine learning, I used the Random Forest classifier to categorize new data. K-Means, as an example of unsupervised learning, let me look at the data without a target variable, find interconnections, and evaluate the quality of the data. Combining the Random Forest classifier and K-Means, I cleaned up the data using semi-supervised learning.

In addition, using Python's scikit-learn libraries ensures easy access to machine learning programming. This gives users more time to explore the constraints and understand the dependencies of the results.

In the end, I think Tom is probably in the living room – or the hallway. Happy hunting!

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Perl – k-means Clusters

    A human observer can register clusters in a two-dimensional set of points at a glance. Artificial intelligence has a harder time getting it done; however, the relatively simple k-means method delivers usable results.

  • KNIME

    They say data is "the new oil," but all that data you collect is only valuable if it leads to new insights. An open source analysis tool called KNIME lets you analyze data through graphical workflows – without the need for programming or complex spreadsheet manipulation.

  • Unsupervised Learning

    The most tedious part of supervised machine learning is providing sufficient supervision. However, if the samples come from a restricted sample space, unsupervised learning might be fine for the task.

  • Treasure Hunt

    A geolocation guessing game based on the popular Wordle evaluates a player's guesses based on the distance from and direction to the target location. Mike Schilli turns this concept into a desktop game in Go using the photos from his private collection.

  • Machine Learning

    "I won't make this mistake again," you promise yourself. In other words, you'll learn from experience. If you translate experience into data, computers can do that, too. We'll introduce you to the fundamental forms of machine learning.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News