Simple web scraping with Bash

Ski Report

© Photo by Nicolai Berntsen on Unsplash

© Photo by Nicolai Berntsen on Unsplash

Article from Issue 262/2022
Author(s):

With one line of Bash code, Pete scrapes the web and builds a desktop notification app to get the daily snow report.

While recently doing a small project, I was amazed by how much web scraping I could do with just one line of Bash. I used the text-based Lynx browser [1] and then piped the output to a grep search. Figure 1 shows the one-line Bash example that scrapes the current snow depth from the Sunshine Village Snow Forecast web page.

Figure 1: One line of Bash code finds the web text for the current snow depth.

In this article, I will introduce some techniques to easily scrape web pages, and then I will create a desktop notification script that provides the daily snow forecast.

The Lynx Text Browser

For my Bash web scraping, I started out by looking at using command-line tools such as curl [2] with the html2text [3] utility. This technique definitely works, but I found that using the Lynx browser offers a one-step solution with a slightly cleaner text output.

To install Lynx on Raspian/Debian/Ubuntu, use:

sudo apt install lynx

The Lynx -dump option will output a web page to text with HTML tags, HTML encoding, and JavaScript removed. Figure 2 shows that a Lynx dump can greatly clean up the original web page and make searching considerably easier.

Figure 2: Lynx output removes HTML tags, encoding, and JavaScript, making it easier to search.

Sometimes a simple Bash grep search might be all that you need. However, there are many cases where some text manipulation is required. The good news is that Bash has a nice selection of line and string manipulation tools.

The example shown in Figure 3 uses line manipulation to find the current weather in Key West, Florida. A grep search is done on the string "As of", and the option -A 3 is used to return the requested line of data with an additional three lines. You can remove the "As of" line with the tail command if required.

Figure 3: Using Bash line manipulation to extract web data.

It's important to note that what you see on a web page may not match the Lynx outputted text, and some trial and error testing might be required.

Figure 4 uses string manipulation to find the new snow at Sunshine Ski Resort. The resort's web page uses JavaScript to show the new snow in either centimeters or inches, but the Lynx text output displays both values and their units.

Figure 4: Here, Bash string manipulation extracts the desired web data.

To remove parts of a string variable, you can use %% to extract the first part of the string and # to extract the last part of the string (as shown in Listing 1).

Listing 1

Extracting Parts of a String

01 $ newsnow="5.2cm2.0"
02 $ # get the part before 'cm'
03 $ echo "${newsnow%%cm*}"
04 5.2
05 $ # get the part after 'cm'
06 $ echo "${newsnow#*cm}"
07 2.0

A Bash Web Scraping Project

To get excited before a family ski trip, I wanted to create a morning notification script that would show the new morning snow and the base snow.

To create the notification script (Listing 2), I used two passes with the Lynx utility. The first pass scrapes for new snow (shown in Figure 4) and then a second pass gets the snow base (shown in Figure 1). The snow results are then passed as a string ($msg) to the notify-send utility [4], which posts the message to the workstation desktop (Figure 5). You can schedule this Bash script to run every morning using either cron or the at utility.

Listing 2

Bash Web Scraping Notification Script

01 #!/bin/bash
02 #
03 # skitrip.sh - show the Sunshine ski conditions in a notification
04 #
05 theurl="https://www.snow-forecast.com/resorts/Sunshine/6day/mid"
06
07 # Get the new snow depth
08 thestr="New snow in Sunshine Village:"
09 result=$(lynx -dump "$theurl" | grep "$thestr")
10 newsnow="${result%%cm*} cm"
11
12 # Get the base
13 thestr="Top Lift:"
14 base=$(lynx -dump "$theurl" | grep "$thestr")
15
16 # Show the results in a desktop notification, with 120 minute wait time
17 msg="$newsnow\n$base (base)"
18 icon="$HOME/Downloads/mountain.png"
19 notify-send -t 120000 -i "$icon"  "Sunshine Ski Resort" "$msg"
Figure 5: Using Bash web scraping, the notification script displays the daily snow report.

Summary

Scraping web pages can be tricky, and the pages can change at anytime. For this reason, it is always best to check if an API is available before looking at web scraping.

Python with the Beautiful Soup library has been my go-to approach for web scraping, but it's nice know that a simple Bash alternative is also available.

The Author

You can investigate more neat projects by Pete Metcalfe and his daughters at https://funprojects.blog.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Web Scraping with Bash

    You can use one line of Bash code to scrape a web page and show the data you're tracking in Home Assistant or Node-RED.

  • Bash Tray Scripts

    YAD lets you customize your system tray with one-line Bash tray scripts.

  • Xidel

    Xidel lets you easily extract and process data from XML, HTML, and JSON documents.

comments powered by Disqus