Scraping the web for data
Data Harvesting

© Lead Image © faithie, 123RF.com
Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.
If you are looking to collect data from the Internet for a personal database, your first stop is often a Google search. However, a search for mortgage rates can (in theory) return dozens of pages full of relevant images and data, as well as a lot of irrelevant content. You could visit every web page pulled up by your search and cut and paste the relevant data into your database. Or you could use a web scraper to automatically download and extract raw data from the web pages and reformat it into a table, graph, or spreadsheet on your computer.
Not just big data professionals, but also small business owners, teachers, students, shoppers, or just curious people can use web scraping to do all manner of tasks from researching a PhD thesis to creating a database of local doctors to comparing prices for online shopping. Unless you need to do really complicated stuff with super-optimized performance, web scraping is relatively easy. In this article, I'll show you how web scraping works with some practical examples that use the open source tool Beautiful Soup.
Caveats
Web scraping does have its limits. First, you have to start off with a well-crafted search engine query; web scraping can't replace the initial search. To protect their business and observe legal constraints, search engines deploy anti-scraping features; overcoming them is not worth the time of the occasional web scraper. Instead, Web scraping shines (and is irreplaceable) after you have completed your web search.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News
-
Linux Mint 22.2 Beta Available for Testing
Some interesting new additions and improvements are coming to Linux Mint. Check out the Linux Mint 22.2 Beta to give it a test run.
-
Debian 13.0 Officially Released
After two years of development, the latest iteration of Debian is now available with plenty of under-the-hood improvements.
-
Upcoming Changes for MXLinux
MXLinux 25 has plenty in store to please all types of users.
-
A New Linux AI Assistant in Town
Newelle, a Linux AI assistant, works with different LLMs and includes document parsing and profiles.
-
Linux Kernel 6.16 Released with Minor Fixes
The latest Linux kernel doesn't really include any big-ticket features, just a lot of lines of code.
-
EU Sovereign Tech Fund Gains Traction
OpenForum Europe recently released a report regarding a sovereign tech fund with backing from several significant entities.
-
FreeBSD Promises a Full Desktop Installer
FreeBSD has lacked an option to include a full desktop environment during installation.
-
Linux Hits an Important Milestone
If you pay attention to the news in the Linux-sphere, you've probably heard that the open source operating system recently crashed through a ceiling no one thought possible.
-
Plasma Bigscreen Returns
A developer discovered that the Plasma Bigscreen feature had been sitting untouched, so he decided to do something about it.
-
CachyOS Now Lets Users Choose Their Shell
Imagine getting the opportunity to select which shell you want during the installation of your favorite Linux distribution. That's now a thing.