- Page: 1.3 » Linux Magazine

A Bash DIY data extraction tool

Filtering the Data

Now that I have gathered the data, I now need to recursively search for occurrences of the word abandon in the downloaded files. For this, I will use another Linux command-line utility, grep. In the directory containing all the folders with the downloaded websites, launch

grep -r -A1 -B1 "abandon" * > results.txt

-r tells grep to search every file and subfolder in the current folder from where the command was launched. grep will search not only for abandon, but also for its variations, including suffixes or prefixes.

However, I am only interested in school dropout and how many times it is mentioned on each university's website. For this, I need to see the word's context and eliminate the unrelated instances. The -A and -B flags are used to verbosely display the line after and before the occurrence of the word. However, the terminal scrolls rapidly in verbose mode. Since I want to take a closer look at the results, I can output everything to a text file named results.txt. As a side note, the * character specifies that grep should search all downloaded files, whether they are in human-readable format or not.

After a bit of processing, the result will be a text file (results.txt) that contains the complete path to the file containing the word abandon and its variations, plus the surrounding context. The occurrences are delimited by the characters --. I need to eliminate these and replace them with something else, because (as you will see later) the next commands tend to interpret -- as a specifier of command attributes. So you can either open the results.txt file in a text editor and do a search and replace or use a Linux command to automatically do this for you. Keep in mind that in doing so you are also replacing the -- characters that might be present in legitimate URLs in the file. I will fix this later.

I chose to replace -- with 12345678 to get rid of all the delimiting lines in the file. I did this, because the next command I use lists the first line after this replaced delimiter – the line containing the local HTTP address of the file:

grep -A 1 -F 12345678 results.txt > 1stline.txt

The above command displays the line immediately after the one beginning with 12345678 and outputs the result in another text file called 1stline.txt. Now I have an almost clean file containing just the addresses of the files that contain the word abandon and all its variations, ready to be inserted into a statistical program. However, wget appends some of the accompanying text at the end of some of the addresses, and I need to get rid of that using the sed editor:

sed 's/<.*//' 1stline.txt > 1stlinefiltered.txt

This command will delete everything after the < character leaving only the addresses, without the beginning of the context text appearing on the same line. The resulting output is put into a new file called 1stlinefiltered.txt.

The instances of abandon and its variations might appear in multiple files, some of which are duplicate addresses. I don't need these, since I only want to work with distinct instances of the word located in distinct files. To delete duplicate lines from 1stlinefiltered.txt, I will use sort and uniq:

sort 1stlinefiltered.txt | uniq -u > address_filtered.txt

The command-line utility sort sorts the file's lines, and uniq -u prints in verbose mode only the unique lines of the specified text file. Everything is outputted through a new file called address_filtered.txt. Because wget sometimes appends a - character at the end of each URL, I must further clean address_filtered.txt. To do this, I will use sed again to delete the last character of each line present in the file:

sed 's/.$//' address_filtered.txt > list_final_address.txt

This time, everything is outputted in list_final_address.txt – a list of URLs pointing to pages that contain the word abandon. Since I previously replaced -- with 12345678 in order to be able to correctly display the first line below each delimiter, some of the web addresses that contained -- in their URL also got their file path changed. All I need to do is take the list_final_address.txt file and do a search and replace for 12345678 with --.

Now list_final_address.txt is clean, and all I have to do is convert it to CSV format for easy importing to a statistical application:

cat list_final_address.txt > address.csv

This final file, address.csv, will give you a column of web addresses in a statistical application like SPSS, each representing a location corresponding to a file containing the word abandon. In addition to the web address column, I also need a column with abandon's corresponding context. The file results.txt still contains this information, and I can use it to extract what I need. You can do this with Bash or a Python script.

Bash Extraction

With Bash, you use grep again (Listing 3). Listing 3 takes results.txt, searches it for the word abandon and all its variations, and displays in verbose mode the 50 characters surrounding the word. One hundred characters with the keyword in the middle should suffice to be able to tell if the context is relevant. Everything is outputted to a new text file called abandon50.txt. From this file, I trim out duplicate lines with:

Listing 3

Bash Extraction

grep -E -o ".{0,50}abandon.{0,50}" results.txt > abandon50.txt

sort abandon50.txt | uniq -u > abandon50_filtered.txt

Success: The resulting file, abandon50_filtered.txt, contains a column of text corresponding to each address in address.csv. The problem with this approach is that each server I initially parsed with wget uses a different CMS. Some university servers might use open source solutions, such as Wordpress or Joomla, while others use custom solutions. Consequently, no two sites are the same.

In addition, most sites contain special characters in their URL (e.g., $ and %), and grep output has difficulties with these characters. A manual search-and-replace for special characters such as %20 should fix the problem. Alternatively, you can use another Linux command-line utility, html2text, that trims special HTML tags from files, leaving only clean human-readable text behind. Once this is done, grep should have no problem in performing correctly.

Python Extraction

If you want to use Python to extract the occurrences of the word abandon surrounded by 50 characters on each side, you can use the Python script in Listing 4. This script will also filter out special characters from URL addresses.

Put the script from Listing 4 in a text file and name it needleinhaystack.py. Make it executable in Linux with the following command:

chmod +x needleinhaystack.py

Listing 4

Python Extraction Script

01 #!/usr/bin/python
02
03 """Custom work for Razvan T. Coloja, placed in the public domain by the author - Radu-Eosif Mihailescu.
04 """
05
06 import sys
07
08 MAGIC_WORD = 'abandon'
09
10 def main(argv):
11 with open(argv[1], 'r') as faddr:
12 addresses = set(l.rstrip() for l in faddr)
13 with open(argv[2], 'r') as fres:
14 the_text = set(l.rstrip() for l in fres)
15
16 for address in addresses:
17 for line in the_text:
18 if line.startswith(address):
19 where_found = line.find(MAGIC_WORD)
20 if where_found != -1:
21 if where_found > 50:
22 start_excerpt = where_found - 50
23 else:
24 start_excerpt = 0
25 print '"%s","%s"' % (
26 address,
27 line[start_excerpt:where_found + len(MAGIC_WORD) + 50])
28
29 if __name__ == '__main__':
30 main(sys.argv)

You need to have Python installed to make this script work. The script will compare the file containing the addresses with the one containing both the addresses and the associated context, trim context to about 100 characters with abandon in the middle, and structure everything into two columns that are ready to be imported into a statistics application.

« Previous 1 2 3 Next »