Scraping the web for data
Data Harvesting
Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.
If you are looking to collect data from the Internet for a personal database, your first stop is often a Google search. However, a search for mortgage rates can (in theory) return dozens of pages full of relevant images and data, as well as a lot of irrelevant content. You could visit every web page pulled up by your search and cut and paste the relevant data into your database. Or you could use a web scraper to automatically download and extract raw data from the web pages and reformat it into a table, graph, or spreadsheet on your computer.
Not just big data professionals, but also small business owners, teachers, students, shoppers, or just curious people can use web scraping to do all manner of tasks from researching a PhD thesis to creating a database of local doctors to comparing prices for online shopping. Unless you need to do really complicated stuff with super-optimized performance, web scraping is relatively easy. In this article, I'll show you how web scraping works with some practical examples that use the open source tool Beautiful Soup.
Caveats
Web scraping does have its limits. First, you have to start off with a well-crafted search engine query; web scraping can't replace the initial search. To protect their business and observe legal constraints, search engines deploy anti-scraping features; overcoming them is not worth the time of the occasional web scraper. Instead, Web scraping shines (and is irreplaceable) after you have completed your web search.
Web page complexity, which has increased over the past 10 to 15 years, also affects web scraping. Due to this complexity, determining the relevant parts of a web page's source code and how to process it has become more time consuming despite the great progress made by web scraping tools.
Dynamic content poses another problem for web scraping. Today, most pages continuously refresh, changing layout from one moment to the next, and are customized for each visitor. This makes the scraping code more complicated, without providing, sometimes, the certainty that the scraper will extract exactly what you would see in your browser. This is particularly problematic for online shopping. Not only does the price change frequently, but the price also depends on many independent factors, from shipping costs to your buyer profile, shopping history, or preferred payment method – all of which are just outside of web scraping's radar. Consequently, the best deal found by an ordinary scraping script may be quite different from what you would be offered when clicking on the corresponding link. (Unless you spent so much time and effort on tweaking the web scraper that you wouldn't have time left to enjoy your purchases!)
Additionally, many web pages only display properly after some JavaScript code has been downloaded and run or after some interaction with the user. Others, like Tumblr blogs, run an "infinite scroll" function. In all of these cases, a scraper may start parsing the code before it is ready for viewing, thus failing to deliver what you would see in your browser.
Changing HTML tags is yet another issue. Scraping works by recognizing certain HTML tags, with certain labels, in the web page you want to scrape. The label names may change after a software upgrade or adoption of a different graphic theme, resulting in your scraper script failing until you update its code accordingly.
Scraping can consume a lot of bandwidth, which may create real problems for the website you are scraping. To remedy this, make your scrapers as slow as possible and scrape only when there is no alternative (i.e., when webmasters don't provide direct access to the data you want), and everybody will be happy.
Finally, a website's business needs, as well as copyright and other legal issues, stand in the way of web scraping. Many webmasters try their best to block automated scraping of their content. This article does not attempt to address the copyright issues related to screen scraping, which can vary with the site requirements and jurisdiction.
Web Scraping Steps
In spite of these caveats, web scraping remains an immensely useful activity (and if you ask me, a really fun one) in many real world cases. In practice, every web scraping project goes through the same five steps: discovery, page analysis, automatic download, data extraction, and data archival.
In the discovery phase, you search for the pages you want to scrape via a search engine or by simply looking at a website's home page.
During page analysis, you study the HTML code of your selected pages to determine the location of the desired elements and how a scraping script might recognize them. Most of the time, HTML tags' id
and class
attributes are the easiest to detect and use for web scraping, but that is not always the case. Only visual inspection can confirm this and give you the names of those attributes. You can get a web page's HTML code by saving the complete web page on your computer or right-clicking on View Page Source (or View Selection source for a selected paragraph) in your browser.
The final three steps (automatic download, data extraction, and data archival) involve writing a script that will actually download the page(s), find the right data inside them, and write them to an external file for later processing. The most common format, which I use in two of my examples, is CSV (Comma-Separated-Values), which is a plain text file with one record per line and fields separated by commas or other predefined characters (I prefer pipes). JSON (JavaScript Object Notation) is another popular choice that is more efficient for certain applications.
Remember to keep both your code and your output data as simple as possible. For most web scraping activities, you will only grab the data once, but may then spend days or months analyzing it. Consequently, it doesn't make sense to optimize the scraping code. For instance, a program that will run once while you sleep (even if it takes a whole night to finish) isn't worth spending two hours of your time to optimize. In terms of your output data, it's difficult to know in advance all the possible ways you may want to process it. Therefore, just make sure that the extracted data is correct and laid out in a structured way when you save it. You can always reformat the data later, if necessary.
Beautiful Soup
Currently, the most popular open source web scraping tool is Beautiful Soup [1], a Python library for extracting data out of HTML and XML files. It has a large user community and very good documentation. If Beautiful Soup 4 is not available as a binary package for your Linux distribution, you can install it with the Python package manager, pip
. On Ubuntu and other Apt-based distributions, you can install pip
and Beautiful Soup with these two commands:
sudo apt-get install python3-pip pip install requests BeautifulSoup4
With Beautiful Soup, you can create scraping scripts for simple text and image scraping or for more complex projects.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.