Build your own crawlers
Spider, Spider

© Lead Image © mtkang, 123RF.com
Scrapy is an open source framework written in Python that lets you build your own crawlers with minimal effort for professional results.
A crawler demonstrates the capabilities of version 1.0 of the Scrapy framework [1] running under Python 2.7 [2]. Scrapy is an open source framework for extracting data from websites. It recursively crawls through HTML documents and follows all the links it finds.
In the spirit of HTML5, the test created in this article is designed to reveal non-semantic markup on websites. The crawler counts the number of words used per page, as well as the number of characteristic tag groups (Table 1), saving the results along with the URL in a database.
To install the required packages, I used the Debian 8 Apt package manager:
[...]
Buy this article as PDF
(incl. VAT)