Build your own crawlers
Spider, Spider
© Lead Image © mtkang, 123RF.com
Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.
A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.
In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.
To install the required packages, I used the Debian 8 Apt package manager:
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Kernel Project Releases Project Continuity Document
What happens to Linux when there's no Linus? It's a question many of us have asked over the years, and it seems it's also on the minds of the Linux kernel project.
-
Mecha Systems Introduces Linux Handheld
Mecha Systems has revealed its Mecha Comet, a new handheld computer powered by – you guessed it – Linux.
-
MX Linux 25.1 Features Dual Init System ISO
The latest release of MX Linux caters to lovers of two different init systems and even offers instructions on how to transition.
-
Photoshop on Linux?
A developer has patched Wine so that it'll run specific versions of Photoshop that depend on Adobe Creative Cloud.
-
Linux Mint 22.3 Now Available with New Tools
Linux Mint 22.3 has been released with a pair of new tools for system admins and some pretty cool new features.
-
New Linux Malware Targets Cloud-Based Linux Installations
VoidLink, a new Linux malware, should be of real concern because of its stealth and customization.
-
Say Goodbye to Middle-Mouse Paste
Both Gnome and Firefox have proposed getting rid of a long-time favorite Linux feature.
-
Manjaro 26.0 Primary Desktop Environments Default to Wayland
If you want to stick with X.Org, you'll be limited to the desktop environments you can choose.
-
Mozilla Plans to AI-ify Firefox
With a new CEO in control, Mozilla is doubling down on a strategy of trust, all the while leaning into AI.
-
Gnome Says No to AI-Generated Extensions
If you're a developer wanting to create a new Gnome extension, you'd best set aside that AI code generator, because the extension team will have none of that.

