Setting up a data analytics environment in Linux with Python
Setting Up the Data Science Libraries
The final step is to install the main libraries. Again, just like with JupyterLab, you have the option of installing them with pip
:
# pip install numpy pandas matplotlib sklearn
or with the OS package manager (Table 2).
This setup leaves you an environment that is ready both for exploratory data analytics, using JupyterLab, and for large batch processing, using the Python interpreter in script mode. Note that JupyterLab allows you to export a notebook to a Python script. You can also distribute results and documentation using Jupyter notebooks, to report data analytics work to clients. There is one more step that some users might want to take, depending on the specific data analytics project, and that is to install additional Python libraries. PyPI [2] lists all the libraries available in pip. It is good practice to explore the package index before a big project and assess the available field-specific libraries, as well as their maturity and compliance with project requirements.
Example
Suppose I want to understand the behavior of the traffic in a parking lot. I will obtain a profile that shows the hourly average occupancy of the parking lot based on data collected in several measurement campaigns, at different days, in different points of the city. First, I need to retrieve the raw data. For this example, I will use the Birmingham Parking dataset [3], which was used in research work on Smart Cities [4]. Dowload the full dataset using wget
:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00482/dataset.zip
You can enter this command in a terminal window, or you can use the special character !
within Jupyter to run a command in an embedded terminal. Next, unzip the data with unzip
.
Given the great variety of formats, processes, and policies of data collection, dataset retrieval will look different each time; sometimes you need to download a ZIP file, sometimes you might just go to a database, or other times you might need to retrieve an SD card from an embedded system. That's the beauty of data science: Each project starts and develops in a different way.
For this example, I will include the Pandas data analytics library. Pandas completely changes the data workflow in Python, making it much more intuitive and easy. Internally, Pandas uses the mechanisms provided by NumPy, thus inheriting its efficiency. One common scenario is to load the data into a Pandas object, on which to perform preliminary data analysis tasks (especially the selection, preprocessing, and transformation stages).
The first step is to read the contents of the file into Pandas DataFrame
, using the function read_csv()
(Figure 6), to which you pass the mandatory filename parameter and an optional parse_dates
parameter to force it to interpret one column as a date-time field. You can then visualize the contents loaded from the file with display()
.
As you can see in Figure 6, the data appears in columns. The first column is SystemCodeNumber
, which is an identifier of the parking lot. The second column (Capacity
) shows the total capacity of the lot, and the third one (Occupancy
) shows the current number of occupied parking spaces. Finally, LastUpdated
shows the time and date of the last sensor reading.
The next step is to apply a selection process to only take the samples of the NIA North
parking lot. For this step, use the .loc
property of the Pandas DataFrame
object, which allows you to filter the rows. The code shown in Figure 7 filters all the entries in df
, where the parking lot name is 'NIA North'
.
The .loc
property is very powerful, allowing filtering with a great variety of conditions. More information can be found in the Pandas documentation [5].
You now have the data of interest in df
. Nevertheless, data in the real world normally comes with errors and/or outliers. This dataset is not an exception, as you can see in the Matplotlib plot shown in Figure 8.
In Figure 8, the readings only come from isolated days where measurements were taken. Also, some values of occupancy are lower than 0 (which is impossible), so I need to remove these wrong values. These errors will be different in each project, so normally you will have to spend some time in this phase thinking of possible errors and chasing them. It takes some experience to do this quickly, and normally you might miss some errors and detect them further down the road. When you do so, you need to come back to this part of the study and add the appropriate mechanisms to detect them. Thanks to Jupyter's nonlinear workflow, you can do this easily by adding or editing cells in the appropriate places. Again, the .loc
method will come in handy. In this case, I will replace the wrong values with None
. If I knew a method to directly correct them, I could have used that method instead. Next, I will fill in the missing values with some generic value. Pandas offers the .fillna()
method for filling missing data. You can fill in a constant value (for instance,
), or use the last known value. I will use the last known value in this case, because a good estimation for occupancy of a parking lot is the occupancy that it had previously. The code in Figure 9 shows the command for cleanup, and Figure 10 shows the corrected data.
Next is the transformation step. Start by thinking about what the modeling process (the next step) requires. Because you want to do an hourly average of the occupancy expressed as a proportion, you'll need two transformations. First, you need to extract the hour from the date-time field, as shown in Figure 11. With this, you can create a new column that only contains the hour. Next, you need to compute a new column that expresses the occupancy as a proportion, instead of an absolute value (Figure 12). Figure 12 also shows the dataset with the new columns.
To build the model in the data mining step, you actually only need the last two columns. Start by taking all the samples for each hour, and then calculate the average of the occupancy. In other words, group by the Hour
column and calculate the mean. Grouping is such a common task that Pandas offers the groupby
shorthand (Figure 13).
groupby
will result in a new data frame, model
, indexed with the unique values of Hour
, and that new data frame contains the average value of all the other numerical fields grouped by Hour
.
In this simple example, the data mining process was intentionally trivial. In some cases, the grouping and averaging operation can even be a part of the transformation step. Data mining can be very complex, including ML/AI processes, different kinds of numerical methods, and other advanced techniques. But there is one secret that all data analysts learn sooner or later: Most of the hard work of the data analytics process is already done before the data mining step. You can now use the model to represent a chart with the occupancy of the parking lot as a percentage for different hours of the day (Figure 14). More complex projects might involve live charts or detailed reports that are sent automatically by email to interested parties.
Conclusions
This article has been a primer on data science. I described how to take the KDD model as the outline for a typical workflow in a data analytics project. You also learned about the main Python libraries used with data science projects. Finally, I reviewed how to get the environment up and running, and I presented a simple example showing how to use it. This brief introduction is just the beginning. I'll leave it to you to discover how to apply the rich Python data analytics ecosystem to the problems you encounter in your own field of expertise.
Infos
- Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "The KDD process for extracting useful knowledge from volumes of data," Communications of the ACM, 39(11), 1996, pp. 27-34
- PyPI: https://pypi.org/
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham
- Stolfi, Daniel H., Enrique Alba, and Xin Yao. "Predicting Car Park Occupancy Rates in Smart Cities." In: Smart Cities: Second International Conference, Smart-CT 2017, M·laga, Spain, June 14-16, 2017, pp. 107-117
- Pandas DataFrame.loc property: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.