- Page: 1.6 » Linux Magazine

A Bash DIY data extraction tool

Putting It All Together

You now have the data you need to do your desired analysis. To save typing each command individually, you can put the above commands into a single Bash script as shown in Listing 5.

Listing 5

Complete Bash Script

01 #!/bin/bash
02 # download the websites specified in addresses.txt one by one
03 wget -cv --progress=bar --connect-timeout=30 --force-directories --ignore-length -r -l 7 --convert-links --waitretry=61 -R gif,jpg,png,svg,pdf $(<addresses.txt)
04 # recursively look for the word "abandon" and its variations and print in verbose mode the line before and after the keyword so we can take a quick look at the context
05 grep -r -A1 -B1 "abandon" * > results.txt
06 # find every line that starts with the "--" delimiter and replace it with "12345678" using your favorite text editor
07 # list the first line after "12345678"
08 grep -A 1 -F 12345678 results2.txt > 1stline.txt
09 # delete everything after the "<" character
10 sed 's/<.*//' 1stline.txt > 1stlinefiltered.txt
11 # list every line only once, without its duplicates
12 sort 1stlinefiltered.txt | uniq -u > address_filtered.txt
13 # remove last character form each line (.html-)
14 sed 's/.$//' address_filtered.txt > list_final_address.txt
15 # create a CSV file containing the web addresses
16 cat list_final_address.txt > address.csv
17 # replace "12345678" with "--" in address.csv because "--" might appear in the URL

Then add the addresses to addresses.txt, each on one line, and save the file in the same folder as the Bash script in Listing 5. Make the script executable with

chmod +x scriptname.sh

Then launch it with ./scriptname.sh.

With a few simple Bash commands, you have a DIY text data collection tool that delivers a CSV file for use in your favorite statistical application.

The Author

Razvan T. Coloja is a psychologist currently finishing his Bachelor's degree and PhD candidacy in sociology. He has been a passionate Linux user and OSS supporter since 1998 and has an interest in SBC clusters, CircuitPython, and machine learning.

« Previous 1 2 3