A Bash DIY data extraction tool
Putting It All Together
You now have the data you need to do your desired analysis. To save typing each command individually, you can put the above commands into a single Bash script as shown in Listing 5.
Listing 5
Complete Bash Script
01 #!/bin/bash 02 # download the websites specified in addresses.txt one by one 03 wget -cv --progress=bar --connect-timeout=30 --force-directories --ignore-length -r -l 7 --convert-links --waitretry=61 -R gif,jpg,png,svg,pdf $(<addresses.txt) 04 # recursively look for the word "abandon" and its variations and print in verbose mode the line before and after the keyword so we can take a quick look at the context 05 grep -r -A1 -B1 "abandon" * > results.txt 06 # find every line that starts with the "--" delimiter and replace it with "12345678" using your favorite text editor 07 # list the first line after "12345678" 08 grep -A 1 -F 12345678 results2.txt > 1stline.txt 09 # delete everything after the "<" character 10 sed 's/<.*//' 1stline.txt > 1stlinefiltered.txt 11 # list every line only once, without its duplicates 12 sort 1stlinefiltered.txt | uniq -u > address_filtered.txt 13 # remove last character form each line (.html-) 14 sed 's/.$//' address_filtered.txt > list_final_address.txt 15 # create a CSV file containing the web addresses 16 cat list_final_address.txt > address.csv 17 # replace "12345678" with "--" in address.csv because "--" might appear in the URL
Then add the addresses to addresses.txt
, each on one line, and save the file in the same folder as the Bash script in Listing 5. Make the script executable with
chmod +x scriptname.sh
Then launch it with ./scriptname.sh
.
With a few simple Bash commands, you have a DIY text data collection tool that delivers a CSV file for use in your favorite statistical application.
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
News
-
Mageia 8 is Now Available with Linux 5.10 LTS
The latest release of Mageia includes improved graphics support for both AMD and NVIDIA GPUs.
-
GNOME 40 Beta has been Released
Anyone looking to test the beta for the upcoming GNOME 40 release can now do so.
-
OpenMandriva Lx 4.2 has Arrived
The latest stable version of OpenMandriva has been released and offers the newest KDE desktop and ARM support.
-
Thunderbird 78 is being ported to Ubuntu 20.04
The Ubuntu developers have made the decision to port the latest release of Thunderbird to the LTS version of the platform.
-
Elementary OS is Bringing Multi-Touch Gestures to the OS
User-friendly Linux distribution, elementary OS, is working to make using the fan-favorite platform even better for laptops.
-
Decade-Old Sudo Flaw Discovered
A vulnerability has been discovered in the Linux sudo command that’s been hiding in plain sight.
-
Another New Linux Laptop has Arrived
Slimbook has released a monster of a Linux gaming laptop.
-
Mozilla VPN Now Available for Linux
The promised subscription-based VPN service from Mozilla is now available for the Linux platform.
-
Wayland and New App Menu Coming to KDE
The 2021 roadmap for the KDE desktop environment includes some exciting features and improvements.
-
Deepin 20.1 has Arrived
Debian-based Deepin 20.1 has been released with some interesting new features.