Setting up a data analytics environment in Linux with Python

Down in the Mine

© Photo by Luca Maffeis on Unsplash

© Photo by Luca Maffeis on Unsplash

Article from Issue 259/2022
Author(s):

The Knowledge Discovery in Data Mining (KDD) method breaks the business of data analytics into easy-to-understand steps. We'll show you how to get started with KDD and Python.

Data analytics is a major force in the current zeitgeist. Analytics are the eyes and ears on a very wide variety of domains (society, climate, health, etc.) to perform an even wider variety of tasks (such as understanding commercial trends, the spread of COVID-19, and finding exoplanets). In this article, I will discuss some fundamentals of data analytics and show how to get started with analytics in Python. Finally, I will show the whole process at work on a simple data analytics problem.

A Primer on Data Analytics

Data analytics uses tools from statistics and computer science (CS), such as artificial intelligence (AI) and machine learning (ML), to extract information from collected data. The collected data is usually very complex and voluminous, and it cannot be interpreted easily (or at all) by humans. Therefore, the data on its own is useless. Information lies hidden within the data, and it takes many forms: repeating patterns, trends, classifications, or even predictive models. You can use this data to uncover insights and build knowledge of the problem you are studying. For example, suppose you wish to measure the traffic in a parking lot that is monitored by a network of IoT sensors covering the whole city. Reading a single occupancy sensor doesn't say anything about the traffic on its own. Neither do the readings of all the parking sensors of the city without any more context. But the timestamped percentage of occupied places within the monitored parking lot does tell us something, and we use this information to derive insights, such as the times of day with maximum traffic.

Learning the mathematical background and analytics tools is only half the journey. Field expertise (experience on the problem that is being studied) is equally important. Some data scientists come from a statistics background, others are computer scientists who pick up the statistics as they go, and many are people starting from a field of expertise who need to learn both the statistics and the computing tools.

One important approach to data analytics is to use the Knowledge Discovery and Data Mining (KDD) model [1] – also known as Knowledge Discovery in Databases. The KDD process (see Figure 1) takes as input raw data from diverse sources (sensors, databases, logs, polls, etc.) and outputs information in the form of graphics, reports, and tables. The process has 5 steps:

  • Selection – normally data sources are way more comprehensive than needed. Sensors might collect data from time periods or spatial locations that are out of the interest range or variables that are of no interest to the problem. This step narrows down the data that we know contains the information of interest. In the case of the parking example, we may only want to select data referring to the parking we want to monitor, as opposed to other parking spots throughout the city.
  • Preprocessing – data is dirty; in other words, it may contain wrong or missing values caused by measurement errors or system failures. These errors can cause problems down the line, such as failures (in the best case) or hidden biases in the extracted information (in the worst case). Detecting wrong and missing data and filling it or dropping samples is part of the preprocessing stage. In the parking example, your might find null values when many parking sensors fail to send a reading or inconsistent values such as negative numbers.
  • Transformation – once a clean dataset is in place, you might need to change its format to fit the requirements of the next stage. Tasks such as binning, converting from strings to numbers, obtaining parameters from images, etc. are just some of the thousands of possible actions in this stage. For the parking lot example, you might wish to do some binning, transforming the timestamp into a label representing an hour of a specific day.
  • Data mining – this is the core of the whole KDD process. Data mining uses algorithms that extract the information from the clean, appropriately formatted data. Mining could consist of simple statistical computations (averages, standard deviations, and percentiles) or complex ML/AI processes (such as deep learning or unsupervised classification). In the parking lot example, I might wish to extract a time profile that represents the occupancy of the parking lot per hour. The process might be something like calculating the percent of occupied places per hour and day, and then averaging for several days at the same hour.
  • Interpretation/evaluation – after you extract the necessary information, you still need one more step in which the results are validated before using the data to generate insights. Especially for complex outputs, such as predictive models, you need to evaluate the accuracy (normally with a separate validation dataset that was not used for the data mining process). Finally, you can use the information to generate insights or predictions. In the parking example, the resulting model could be used in a report that highlights the need for expanding the lot due to saturation at peak hours.
Figure 1: KDD is a systematic process for analyzing data.

All of these tasks that form part of the KDD process must be supported by a computing platform. You'll need appropriate network connections to the sources, data storage, and a rich toolset to preprocess, transform, and mine the data. Linux is the ideal platform that provides all of these tools, thanks to its great network capabilities, the availability of databases (both small like SQLite and large, unstructured databases like MongoDB) and the great variety of FOSS tools for data processing and representation (such as LaTeX, gnuplot, or web servers for interactive and real-time reports). Among data scientists, Python stands out as an easy and capable programming language with a very comprehensive set of libraries for processing data from very diverse fields. And all of this comes with the advantages of FOSS.

Updating with Pip

It may seem redundant to have a package manager for Python when there are so many wonderful distro-specific package managers in Linux. But pip has some unique functionality that make it specially useful. Pip has a browsable repository [2], where the latest versions of libraries are promptly available. You can download and upgrade packages throughout the life cycle of a project, to integrate new functions or fixes. The first package you must upgrade is pip itself: pip install pip --upgrade. This should be done normally right after creating a new virtual environment. In general, to upgrade installed packages, enter pip install package --upgrade.

The Python Programming Language

The principal benefits of Python are its ease of use, code clarity, and extensibility. Another important advantage of Python is the very wide ecosystem of libraries for many different fields. See the box entitled "Python Elements" for more on the basic components of the Python environment. No matter the topic (astronomy, macroeconomics, personal accounting, computer vision …), you will find a library in Python tailored to it. Coincidentally, data mining is also applicable to many different fields. The availability of both field-specific libraries and data analytics libraries makes Python quite appealing to data scientists. The box entitled "Python Data Science Libraries" highlights some of the important libraries used with data analytics applications.

Python Data Science Libraries

One supreme advantage of Python over other platforms is the rich ecosystem of libraries for data analytics, along with a myriad of smaller, field-specific libraries. Important libraries include:

  • NumPy – the NumPy library provides a set of tools that make Python an efficient language for numerical computation. NumPy consists of the following elements: the ndarray object (which implements n-dimensional vectors and arrays), the operators and functions needed to perform mathematical calculations with ndarray efficiently, functions to read and write data from disk and memory, and various mathematical algorithms (e.g., generation of series, random numbers, transforms, etc.). NumPy is in the core of most mathematical libraries in Python.
  • Pandas – like NumPy, Pandas provides new data types (mainly, the Series and the DataFrame objects), and a whole ecosystem of functions around them.
  • Scikit-learn – provides a set of algorithms that can be used with medium-sized datasets to train classifiers and regressors with machine learning, as well as predictive models. In other words, the Scikit-learn library is mainly for the data mining stage of the analytics process. It offers algorithms such as random forests, neural networks, and ensemble methods. Scikit-learn also provides functions for preprocessing the data before the data mining stage.
  • Keras/TensorFlow – for projects with very large datasets, TensorFlow provides a deep learning platform, which trains multi-layer neural networks for classification and regression. Keras is a high-level interface for TensorFlow, which simplifies its usage.
  • Matplotlib – provides a very rich set of functions for graphical representations. From simple line plots to animated 3D graphs, Matplotlib allows almost any kind of graphic representation. It also includes functions for annotating graphs, and manipulating axes. Matplotlib is the equivalent of a Swiss army knife for graphic representations, although the learning curve for complex graphs might be steep. There are many other libraries for graphic representations in Python, tailored to specific uses and based on Matplotlib. The Matplotlib library is especially useful for the final stage of the data analytics process, where it is useful for representing information in an understandable manner.
  • Seaborn – based on Matplotlib, Seaborn provides quick data visualizations that include empiric probability density functions, linear regression plots, grid views, and more. The Seaborn library is very useful for exploratory investigation of the data, before deciding, for instance, which algorithms will be applied in the data mining stage.

Python Elements

The Python environment consists of several important elements, including the Python interpreter, the package manager, the shell, and the virtual environment manager.

The interpreter is the virtual machine that reads and executes the code. A Python interpreter is usually present in most Linux distributions by default. Although most users rely on the vanilla interpreter (CPython), there are several alternative, specialized interpreters, such as Jython (integrated with the Java VM), PyPy (more performant than vanilla Python), or MicroPython (geared towards microcontrollers). Due to compatibility with libraries, vanilla Python is recommended for most of the tasks.

The package manager downloads and installs libraries that can be used to extend the basic functionality of Python. The default package manager is called pip, and it has an online repository of more than 300,000 libraries called the Python Package Index (PyPI) [2]. An alternative to pip is Conda, which is related to the Anaconda Python Distribution. Anaconda packages the basic data science tools of Python, and it is especially useful for Windows and macOS users, where Python is not that well integrated into the system. Anaconda is also available for Linux, but it may add one layer of complexity in exchange of providing a sane collection of preinstalled packages. See Table 1 for a comparison between Anaconda Python and vanilla Python in Linux.

Both Conda and pip can coexist in an install, but it is better to not mix them up if possible. Some package developers also distribute their libraries without integrating them into any repository. In that case, there are several common installation methods (easy_install or the setup.py script). Although the repositories see some kind of curation (albeit far from secure), downloading packages from the Internet without checking the code is always a bad idea.

Another important element is the shell, which is the interactive interface to the interpreter, not unlike a terminal like Bash or Zsh. The basic shell is offered by the interactive mode of the vanilla Python interpreter. It can be invoked by calling python in a terminal (see Figure 2), and it can be used for small tasks and testing out simple code. IPython offers an improved interactive experience, with functions such as syntax highlighting and code completion. You can also use IDEs, such as PyCharm (which has a FOSS community edition) or Spyder (which will be familiar to users coming from MATLAB).

Figure 2: The vanilla Python interpreter.

A final important component is the virtual environment manager, which is a utility that creates isolated setups of the Python interpreter and packages for different projects. There is a base environment, associated with the system Python interpreter that is rooted within the basic OS filesystem. The virtual environments inherit the packages from the base environment and are rooted in a specific directory (normally within the home directory of the user). All the previously described elements can "live" either in the base environment or within a virtual environment. The basic manager included with Python is venv. While there are many other compatible alternatives (mainly to support older versions of Python), it is good practice to use venv. Another alternative available to Anaconda users is Conda, which also has the capability of creating virtual environments. Conda cannot be mixed with venv.

Table 1

Should You Use Anaconda?

Anaconda

Vanilla Python

Pro: All data science packages in one package.

Con: Need to install packages one by one.

Pro: Includes Anaconda Navigator, a GUI for managing the environment.

Con: Need to manually manage everything (not a con for many people).

Con: Uses the Conda package manager, which is not as complete as pip and interferes with it.

Pro: Has fewer "moving parts."

Con: Anaconda is not that well integrated into Linux package managers.

Pro: Python is very well integrated into most Linux distributions, unlike in Windows or macOS.

Most Linux variants already have Python installed by default, or as a dependency of another package. But normally, only the interpreter is installed by default, so you need to manually install the rest of the elements. Table 2 shows the names of the Python interpreter, the pip package manager, and the venv library for some of the most popular distributions. Note that in most distributions, you must explicitly indicate that you are installing Python 3, to avoid confusion with Python 2 (which was deprecated in 2020, but is still a dependency of some software packages). When calling the interpreter, you must make sure that Python 3 is invoked, not Python 2. For instance, in Ubuntu, Debian, and openSUSE, you need to explicitly use the python3 command when invoking the interpreter. You can fix this in Ubuntu and Debian by installing the package python-is-python3. Another way to fix this problem is to use a use a virtual environment, where the default interpreter of the system is overridden.

Table 2

Package Names for Python Components

Component

Ubuntu/Debian

Fedora

Arch

openSUSE

Basic environment

python3, python3-pip, python3-venv, python-is-python3

python3, python3-pip

python, python-pip

python3, python3-pip

Jupyter

python3-jupyter (no JupyterLab)

python3-notebook (no JupyterLab)

jupyterlab

python3-jupyterlab

NumPy

python3-numpy

python3-numpy

python-numpy

python3-numpy

Pandas

python3-pandas

python3-pandas

python-pandas

python3-pandas

Scikit-learn

python3-sklearn

python3-scikit-learn

python-scikit-learn

python3-sklearn

Matplotlib

python3-matplotlib

python3-matplotlib

python-matplotlib

python3-matplotlib

Seaborn

python3-seaborn

python3-seaborn

python-seaborn

python3-seaborn

Keras

python3-keras

Not in default repositories

python-keras

python3-keras

Once the basic packages are installed, you can proceed to create a virtual environment. Although this step is optional, it is highly recommended when working with many different projects in parallel, which is normal in the life of a data scientist. A virtual environment will have a root folder with its own Python interpreter and its own package selection. The new Python interpreter will have access to the packages of the system interpreter. Note that, if Python  2 and Python 3 coexist in the system, a virtual environment created with Python 3 will only have access to the packages of the Python 3 system interpreter. Any package you install in the virtual environment will only be accessible within it. Also, you do not need to call python3 or pip3 explicitly, because within the virtual environment, only Python 3 is available.

To create a new virtual environment, open a terminal, cd into the directory where you want to create it, and run the following command:

# python3 -m venv new_environment_root

where new_environment_root can be any name. This command will only create the environment; you will not be able to use it until you activate it. For that, run the following command without changing the directory:

# source new_environment_root/bin/activate

This command will modify the terminal session to use the virtual environment's interpreter, along with the packages it can access. It will also change the behavior of the pip package manager, so packages are installed in the virtual environment. If you install a package that is also installed in the system, it will be overriden only within the virtual environment. This is ideal for when you need a specific version of a package. When you are done working with the environment, change the terminal session back with:

# deactivate

Setting Up Jupyter

JupyterLab and Jupyter Notebooks are very important components in the Python data analytics environment. Jupyter proposes a completely different way of using Python, by providing an experimentation + coding + documentation workflow. In JupyterLab, you can create notebooks (Figure  3), which contain both code cells (that can be run in an interactive way and in no particular order) and documentation cells (which can contain Markdown, HTML, and LaTeX code). Because Jupyter lets you run different cells in a nonlinear fashion, it is a particularly useful environment for experimenting and doing exploratory data analytics. The output of code cells (both in the form of text or graphics) is also embedded in the document. We can therefore produce comprehensive documents with interactive code and even graphics, which we can use to document the process of data analytics.

Figure 3: Example of a Jupyter notebook: Yes, this is a program, not a textbook!

Once you have Python installed in the system and a virtual environment to work in, you can start an interactive session with the interpreter or run scripts right away; but in order to use a powerful data science environment, you must install some tools and packages. The first one is JupyterLab. You have two installation options: using pip or using the OS package manager. Each method has its own advantages and disadvantages.

To install Jupyter with pip, run:

# pip install jupyterlab

This command will make JupyterLab available to the current virtual environment (if none is activated, it will install it on the system's Python installation) and will download the latest version. Once Jupyter is installed, you will have to upgrade manually from time to time (see the "Upgrading with Pip" box).

The other way of installing JupyterLab is using the OS package manager, which will make it available to all the virtual environments and will subject it to the update cycle of the OS, but will not install the latest version. Especially for distributions that use oldish packages (Debian Stable, Ubuntu LTS …), the installed version might be significantly outdated. Table 2 shows the name of the package for Jupyter in the main distributions.

Regardless of how you install it, to run JupyterLab, execute the following in a terminal while the virtual environment is active:

# jupyter-lab

This command will launch a server in port 8888 by default (Figure 4). The server shows a URL that can be opened in a browser to access the JupyterLab interface. The server will attempt to launch the default browser. Note that the server window must not be closed while working with Jupyter.

Figure 4: The Jupyter server terminal.

The web interface (Figure 5) will allow you to create new notebooks (which is the default document type for Jupyter) in the directory you select in the file manager. Figure 5 shows the JupyterLab screen, with the different areas marked. At the left side, you will find the file browser, where you can manage directories. On the right side is the main working area, where you will find a tabbed interface for the different notebooks. To create a new notebook, press the + button above the file explorer, which will open a new tab that offers the possibility of creating several new objects. Within the notebook, you will see documentation cells, where we can write Markdown, LaTeX, or HTML expressions, and code cells, where you write Python code. The code produces output right below the code cell, with text or graphic output.

Figure 5: JupyterLab user interface.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Data Visualization in Python

    Python's powerful Matplotlib, Bokeh, PyQtGraph, and Pandas libraries lend programmers a helping hand when visualizing complex data and their relationships.

  • JSON Deep Dive

    JSON data format is a standard feature of today's Internet – and a common option for mobile and desktop apps – but many users still regard it as something of a mystery. We'll take a close look at JSON format and some of the free tools you can use for reading and manipulating JSON data.

  • CircuitPython

    The CircuitPython run-time environment runs on almost all microcomputers and microcontrollers, making it perfect for cross-platform programming.

  • Panda3D

    Several free game engines are available for Linux users, but programming with them is often less than intuitive. Panda3D is an easy-to-use engine that is accessible enough for newcomers but still powerful enough for the pros at Disney Studios.

  • PyPy and Nuitka

    PyPy and Nuitka improve the performance of Python on a Raspberry Pi.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

Subscribe to our Linux newsletters

News