A Bash DIY data extraction tool

Data Collector

© Lead Image © Mike Espenhain, 123RF.com

Article from Issue 234/2020

Author(s): Razvan T. Coloja

With some simple Bash commands, you can gather, parse, and filter text data into CSV files ready for your favorite statistical application.

If your research involves pulling large amounts of text data from the Internet, you can gather and process that data from the command line with a few simple Bash commands and turn it into a CSV file for your favorite statistical application, such as SPSS, R, or a MySQL table. In this article, I will show how to accomplish this with a project that examines the Romanian university dropout rate.

The data I need comes from 97 universities. For confidentiality reasons, chances are slim that I can get access to each university's database, but I can obtain that information legally from their website. (However keep in mind that many websites have licenses that prohibit web scraping. This article does not attempt to address copyright and other legal issues related to this practice. See the site's permission page and consult the applicable laws for your jurisdiction.) To gather my data, I could search for the word abandon (Romanian for dropout) on each of the 97 websites, but that would be tedious. Furthermore, each website may use a different content management system (CMS), so my search might not return the desired results. Instead, an easier option is to download all 97 websites in their entirety and recursively search their text content on my local hard drive. Linux lets you do this with the command shown in Listing 1.

Retrieving Data

In Listing 1, wget is a command-line utility in Linux and other POSIX-compliant operating systems used to download files from servers. It can be used as a mass downloader, and you can specify exactly which type of files you want downloaded and which type of files wget should disregard.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

MX Linux 25.1 Features Dual Init System ISO

Desktop , MX Linux , Systemd

The latest release of MX Linux caters to lovers of two different init systems and even offers instructions on how to transition.
Photoshop on Linux?

graphics , Linux , Software

A developer has patched Wine so that it'll run specific versions of Photoshop that depend on Adobe Creative Cloud.
Linux Mint 22.3 Now Available with New Tools

Administration , Linux mint , Tools

Linux Mint 22.3 has been released with a pair of new tools for system admins and some pretty cool new features.
New Linux Malware Targets Cloud-Based Linux Installations

Cloud , Linux , malware

VoidLink, a new Linux malware, should be of real concern because of its stealth and customization.
Say Goodbye to Middle-Mouse Paste

Firefox , Gnome , Security

Both Gnome and Firefox have proposed getting rid of a long-time favorite Linux feature.
Manjaro 26.0 Primary Desktop Environments Default to Wayland

Desktop , Manjaro Linux , Wayland

If you want to stick with X.Org, you'll be limited to the desktop environments you can choose.
Mozilla Plans to AI-ify Firefox

Artificial Inte... , Firefox , privacy

With a new CEO in control, Mozilla is doubling down on a strategy of trust, all the while leaning into AI.
Gnome Says No to AI-Generated Extensions

Artificial Inte... , Gnome , LLM

If you're a developer wanting to create a new Gnome extension, you'd best set aside that AI code generator, because the extension team will have none of that.
Parrot OS Switches to KDE Plasma Desktop

Linux , Parrot OS , Plasma

Yet another distro is making the move to the KDE Plasma desktop.
TUXEDO Announces Gemini 17

Hardware , laptop , Linux

TUXEDO Computers has released the fourth generation of its Gemini laptop with plenty of updates.

A Bash DIY data extraction tool

Data Collector

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

MX Linux 25.1 Features Dual Init System ISO

Photoshop on Linux?

Linux Mint 22.3 Now Available with New Tools

New Linux Malware Targets Cloud-Based Linux Installations

Say Goodbye to Middle-Mouse Paste

Manjaro 26.0 Primary Desktop Environments Default to Wayland

Mozilla Plans to AI-ify Firefox

Gnome Says No to AI-Generated Extensions

Parrot OS Switches to KDE Plasma Desktop

TUXEDO Announces Gemini 17

A Bash DIY data extraction tool

Data Collector

Retrieving Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters