Build your own crawlers

Spider, Spider

Article from Issue 191/2016

Author(s): Andreas Moller

Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.

A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.

In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.

To install the required packages, I used the Debian 8 Apt package manager:

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Download Article PDF now with Express Checkout

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subscriptions

Digital Subscriptions

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Framework Laptop 13 Pro Competes with the Best

Hardware , laptop , Linux

Framework has released what might be considered the MacBook of Linux devices.
The Latest CachyOS Features Supercharged Kernel

Arch Linux , CachyOS , Operating Systems

The latest release of CachyOS brings with it an enhanced version of the latest Linux kernel.
Kernel 7.0 Is a Bit More Rusty

Kernel , Performance , Rust

Linux kernel 7.0 has been released for general availability, with Rust finally getting its due.
France Says "Au Revoir" to Microsoft

Digital Soverei... , Linux , open source

In a move that should surprise no one, France announced plans to reduce its reliance on US technology, and Microsoft Windows is the first to get the boot.
CIQ Releases Compatibility Catalog for Rocky Linux

Enterprise Linux , Linux , Rocky Linux

The company behind Rocky Linux is making an open catalog available to developers, hobbyists, and other contributors, so they can verify and publish compatibility with the CIQ lineup.
KDE Gets Some Resuscitation

KDE , Linux , Plasma

KDE is bringing back two themes that vanished a few years ago, putting a bit more air under its wings.
Ubuntu 26.04 Beta Arrives with Some Surprises

Games , graphics , Ubuntu

Ubuntu 26.04 is almost here, but the beta version has been released, and it might surprise some people.
Ubuntu MATE Dev Leaving After 12 years

projects , Ubuntu , Ubuntu MATE

Martin Wimpress, the maintainer of Ubuntu MATE, is now searching for his successor. Are you the next in line?
Kali Linux Waxes Nostalgic with BackTrack Mode

Kali Linux , Operating Systems , penetration tes...

For those who've used Kali Linux since its inception, the changes with the new release are sure to put a smile on your face.
Gnome 50 Smooths Out NVIDIA GPU Issues

Desktop , Games , Gnome

Gamers rejoice, your favorite pastime just got better with Gnome 50 and NVIDIA GPUs.

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Framework Laptop 13 Pro Competes with the Best

The Latest CachyOS Features Supercharged Kernel

Kernel 7.0 Is a Bit More Rusty

France Says "Au Revoir" to Microsoft

CIQ Releases Compatibility Catalog for Rocky Linux

KDE Gets Some Resuscitation

Ubuntu 26.04 Beta Arrives with Some Surprises

Ubuntu MATE Dev Leaving After 12 years

Kali Linux Waxes Nostalgic with BackTrack Mode

Gnome 50 Smooths Out NVIDIA GPU Issues

Build your own crawlers

Spider, Spider

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters