A Bash DIY data extraction tool

Putting It All Together

You now have the data you need to do your desired analysis. To save typing each command individually, you can put the above commands into a single Bash script as shown in Listing 5.

Listing 5

Complete Bash Script

01 #!/bin/bash
02 # download the websites specified in addresses.txt one by one
03 wget -cv --progress=bar --connect-timeout=30 --force-directories --ignore-length -r -l 7 --convert-links --waitretry=61 -R gif,jpg,png,svg,pdf $(<addresses.txt)
04 # recursively look for the word "abandon" and its variations and print in verbose mode the line before and after the keyword so we can take a quick look at the context
05 grep -r -A1 -B1 "abandon" * > results.txt
06 # find every line that starts with the "--" delimiter and replace it with "12345678" using your favorite text editor
07 # list the first line after "12345678"
08 grep -A 1 -F 12345678 results2.txt > 1stline.txt
09 # delete everything after the "<" character
10 sed 's/<.*//' 1stline.txt > 1stlinefiltered.txt
11 # list every line only once, without its duplicates
12 sort 1stlinefiltered.txt | uniq -u > address_filtered.txt
13 # remove last character form each line (.html-)
14 sed 's/.$//' address_filtered.txt > list_final_address.txt
15 # create a CSV file containing the web addresses
16 cat list_final_address.txt > address.csv
17 # replace "12345678" with "--" in address.csv because "--" might appear in the URL

Then add the addresses to addresses.txt, each on one line, and save the file in the same folder as the Bash script in Listing 5. Make the script executable with

chmod +x scriptname.sh

Then launch it with ./scriptname.sh.

With a few simple Bash commands, you have a DIY text data collection tool that delivers a CSV file for use in your favorite statistical application.

The Author

Razvan T. Coloja is a psychologist currently finishing his Bachelor's degree and PhD candidacy in sociology. He has been a passionate Linux user and OSS supporter since 1998 and has an interest in SBC clusters, CircuitPython, and machine learning.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line: Wget

    Wget downloads files and even whole websites from the command line.

  • Swiss File Knife

    Swiss File Knife replaces more than 100 individual command-line tools at once, but it still fits on a USB stick and runs on all major operating systems.

  • Bash Tricks

    The Bash shell is powerful and infinitely expressive. In this article we describe some tricky techniques you can use to enhance and customize your Bash environment.

  • DIY Web Server

    If you want to learn a little bit more about the communication between a web browser and an HTTP server, why not build your own web server and take a closer look.

  • Bash Tuning

    In the old days, shells were capable of little more than calling external programs and executing basic, internal commands. With all the bells and whistles in the latest versions of Bash, however, you hardly need the support of external tools.

comments powered by Disqus