A Bash DIY data extraction tool

Filtering the Data

Now that I have gathered the data, I now need to recursively search for occurrences of the word abandon in the downloaded files. For this, I will use another Linux command-line utility, grep. In the directory containing all the folders with the downloaded websites, launch

grep -r -A1 -B1 "abandon" * > results.txt

-r tells grep to search every file and subfolder in the current folder from where the command was launched. grep will search not only for abandon, but also for its variations, including suffixes or prefixes.

However, I am only interested in school dropout and how many times it is mentioned on each university's website. For this, I need to see the word's context and eliminate the unrelated instances. The -A and -B flags are used to verbosely display the line after and before the occurrence of the word. However, the terminal scrolls rapidly in verbose mode. Since I want to take a closer look at the results, I can output everything to a text file named results.txt. As a side note, the * character specifies that grep should search all downloaded files, whether they are in human-readable format or not.

After a bit of processing, the result will be a text file (results.txt) that contains the complete path to the file containing the word abandon and its variations, plus the surrounding context. The occurrences are delimited by the characters --. I need to eliminate these and replace them with something else, because (as you will see later) the next commands tend to interpret -- as a specifier of command attributes. So you can either open the results.txt file in a text editor and do a search and replace or use a Linux command to automatically do this for you. Keep in mind that in doing so you are also replacing the -- characters that might be present in legitimate URLs in the file. I will fix this later.

I chose to replace -- with 12345678 to get rid of all the delimiting lines in the file. I did this, because the next command I use lists the first line after this replaced delimiter – the line containing the local HTTP address of the file:

grep -A 1 -F 12345678 results.txt > 1stline.txt

The above command displays the line immediately after the one beginning with 12345678 and outputs the result in another text file called 1stline.txt. Now I have an almost clean file containing just the addresses of the files that contain the word abandon and all its variations, ready to be inserted into a statistical program. However, wget appends some of the accompanying text at the end of some of the addresses, and I need to get rid of that using the sed editor:

sed 's/<.*//' 1stline.txt > 1stlinefiltered.txt

This command will delete everything after the < character leaving only the addresses, without the beginning of the context text appearing on the same line. The resulting output is put into a new file called 1stlinefiltered.txt.

The instances of abandon and its variations might appear in multiple files, some of which are duplicate addresses. I don't need these, since I only want to work with distinct instances of the word located in distinct files. To delete duplicate lines from 1stlinefiltered.txt, I will use sort and uniq:

sort 1stlinefiltered.txt | uniq -u > address_filtered.txt

The command-line utility sort sorts the file's lines, and uniq -u prints in verbose mode only the unique lines of the specified text file. Everything is outputted through a new file called address_filtered.txt. Because wget sometimes appends a - character at the end of each URL, I must further clean address_filtered.txt. To do this, I will use sed again to delete the last character of each line present in the file:

sed 's/.$//' address_filtered.txt > list_final_address.txt

This time, everything is outputted in list_final_address.txt – a list of URLs pointing to pages that contain the word abandon. Since I previously replaced -- with 12345678 in order to be able to correctly display the first line below each delimiter, some of the web addresses that contained -- in their URL also got their file path changed. All I need to do is take the list_final_address.txt file and do a search and replace for 12345678 with --.

Now list_final_address.txt is clean, and all I have to do is convert it to CSV format for easy importing to a statistical application:

cat list_final_address.txt > address.csv

This final file, address.csv, will give you a column of web addresses in a statistical application like SPSS, each representing a location corresponding to a file containing the word abandon. In addition to the web address column, I also need a column with abandon's corresponding context. The file results.txt still contains this information, and I can use it to extract what I need. You can do this with Bash or a Python script.

Bash Extraction

With Bash, you use grep again (Listing 3). Listing 3 takes results.txt, searches it for the word abandon and all its variations, and displays in verbose mode the 50 characters surrounding the word. One hundred characters with the keyword in the middle should suffice to be able to tell if the context is relevant. Everything is outputted to a new text file called abandon50.txt. From this file, I trim out duplicate lines with:

Listing 3

Bash Extraction

grep -E -o ".{0,50}abandon.{0,50}" results.txt > abandon50.txt

sort abandon50.txt | uniq -u > abandon50_filtered.txt

Success: The resulting file, abandon50_filtered.txt, contains a column of text corresponding to each address in address.csv. The problem with this approach is that each server I initially parsed with wget uses a different CMS. Some university servers might use open source solutions, such as Wordpress or Joomla, while others use custom solutions. Consequently, no two sites are the same.

In addition, most sites contain special characters in their URL (e.g., $ and %), and grep output has difficulties with these characters. A manual search-and-replace for special characters such as %20 should fix the problem. Alternatively, you can use another Linux command-line utility, html2text, that trims special HTML tags from files, leaving only clean human-readable text behind. Once this is done, grep should have no problem in performing correctly.

Python Extraction

If you want to use Python to extract the occurrences of the word abandon surrounded by 50 characters on each side, you can use the Python script in Listing 4. This script will also filter out special characters from URL addresses.

Put the script from Listing 4 in a text file and name it needleinhaystack.py. Make it executable in Linux with the following command:

chmod +x needleinhaystack.py

Listing 4

Python Extraction Script

01 #!/usr/bin/python
02
03 """Custom work for Razvan T. Coloja, placed in the public domain by the author - Radu-Eosif Mihailescu.
04 """
05
06 import sys
07
08 MAGIC_WORD = 'abandon'
09
10 def main(argv):
11 with open(argv[1], 'r') as faddr:
12 addresses = set(l.rstrip() for l in faddr)
13 with open(argv[2], 'r') as fres:
14 the_text = set(l.rstrip() for l in fres)
15
16 for address in addresses:
17 for line in the_text:
18 if line.startswith(address):
19 where_found = line.find(MAGIC_WORD)
20 if where_found != -1:
21 if where_found > 50:
22 start_excerpt = where_found - 50
23 else:
24 start_excerpt = 0
25 print '"%s","%s"' % (
26 address,
27 line[start_excerpt:where_found + len(MAGIC_WORD) + 50])
28
29 if __name__ == '__main__':
30 main(sys.argv)

You need to have Python installed to make this script work. The script will compare the file containing the addresses with the one containing both the addresses and the associated context, trim context to about 100 characters with abandon in the middle, and structure everything into two columns that are ready to be imported into a statistics application.

« Previous 1 2 3 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
New Pentesting Distribution to Compete with Kali Linux

Linux , open source , Tools , Ubuntu

SnoopGod is now available for your testing needs
Juno Computers Launches Another Linux Laptop

Hardware , laptop , Linux , Ubuntu

If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.

A Bash DIY data extraction tool

Filtering the Data

Bash Extraction

Python Extraction

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

New Pentesting Distribution to Compete with Kali Linux

Juno Computers Launches Another Linux Laptop

A Bash DIY data extraction tool

Filtering the Data

Bash Extraction

Python Extraction

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters