Build your own crawlers
Pneumatic Tube
The item pipeline continues to process the Item
objects; the ITEM_PIPELINES
variable configures the pipeline in lines 4-8 of the sample application in Listing 3. It passes each of the Item
objects one by one to an object of pipeline classes Words
(300), Injections
(400), and Attributes
(500), then modifies or stores them and pushes them forward.
Listing 6 shows the code for the Words
class, which checks and standardizes the words for the following evaluation. Scrapy creates and binds the pipeline objects by calling the process_item()
method for each item
object (line 2). The method expects the item
object as the second argument and the spider
object, a reference to the calling spider, as the third object.
Listing 6
mirror/pipelines/normalize.py
The next two lines overwrite the values of the keywords
and words
attributes. In line 4, the filter()
function fishes all the items from the item[key]
list, for which the lambda function (lambda wd: wd.isalnum()
) returns a true value. This operation affects strings of purely alphanumeric characters.
The second lambda function – map()
– then formats the words in the resulting set to lowercase. The return
statement in the last line hands over the item
object to the next pipeline object (Listing 7), which considers the possible effect of foreign content on the current browser session.
Listing 7
mirror/pipelines/filter.py
The Injections
pipeline class from Listing 7 relies on the helper functions is_absurl()
and domain()
, which come from mirror/utils.py
, to reduce the list of resources bound to foreign content.
To do this, the process_item()
method overwrites the attribute using the filtered list, which the list expression in line 5 creates. The first for
loop only reads the URL of the page; the second for
iterates over the tags to be injected. If the if
statement determines that the attribute is an absolute URL different from the current domain, the loop picks up the resource from the other
variable. The return
statement hands over the item altered in this way to the last link in the pipeline (Listing 8), which reduces the results for later evaluation and stores them in a database.
Listing 8
mirror/pipelines/store.py
The Attributes
pipeline class (Listing 8, line 6) evaluates the item object and stores the results in a SQLite database file [9]. The free SQL-compatible database framework does not require server processes and supports all common programming languages.
Listing 8 writes all data directly and synchronously into the database file. Line 1 binds the correct driver for Python, and the next three lines import the required functions from the standard packages os.path
, time
, and mirror/utils.py
.
As its second parameter, the __init__()
constructor accepts the path
with which Python opens the SQLite database. Scrapy uses the from_crawler()
class method (lines 10-12) to instantiate an object. A look at the method shows that from_crawler()
adds the crawler
object to its parameter list as a reference to the settings from Listing 3. First, it reads the value of the RESULTS
variable (Listing 3, line 9); then, it adds that value to the constructor call in lines 7 and 8 (Listing 8). Finally, gmtime()
and strftime()
, in combination with a string formatting variant, generate a timestamp for the file name.
The open_spider()
method (lines 14-17) calls the Scrapy engine once only, in the style of a callback function, when creating the spiders. It creates and stores a database connection in the conn
attribute in line 15. Specifying isolation_level = None
tells the driver to create each SQL statement at once persistently in the database file. Line 16 creates and stores the database cursor
object that runs database operations. The operations also include the SQL command that creates the result table:
CREATE TABLE Attributes (url text PRIMARY KEY, keywords int, words int, relevancy int, tags int, semantics int, medias int, links int, injections int)
The process_item()
method from line 22 combines the values of the item
object with the Attributes
table according to Table 2 using the SQL INSERT
command. The question marks are replaced by the values of the next tuple in the parentheses. The len()
function counts the length of the parsed lists several times. The helper function optvalue()
swaps None
values for empty lists; relevance()
determines the incidence of all keywords
in the remaining text of the website.
Table 2
Interpretation of Acquired Data
Size | Computation | Interpretation |
---|---|---|
Relevancy |
– |
Measure of the credibility of the title |
Entropy |
(words+semantics)/(words+tags) |
Non-semantic tags such as div or span reduce the information content |
Expressivity |
semantics/tags |
Semantic tags improve the functional classification of document components |
Richness |
medias |
Media enrich the content |
Reliability |
links/words |
Links vouch for the credibility |
Mutability |
injections |
External resources alienate the page |
Evaluation
As discussed earlier in the article, you launch the crawler at the command line from within the mirror
project directory:
scrapy crawl attr
Listing 9 shows the SQL query that generates the report shown in Figure 4 according to Tables 1 and 2. The strength of SQL is revealed in the compact style of expression, which is very similar to sentences. However, converting the types and formatting requires tedious typing.
Listing 9
Report SQL Query
The endogenous page factors evaluate a web page from various perspectives. The derived entropy value is a measure of the average information content of the page. The values are worded identically but are not identical to the corresponding terminology from information theory [10].
In the sample application, they describe only how non-semantic tags like div
and span
dilute the content. If the entropy value were 1, the generic spider would achieve better results. The average of 0.837 in Figure 4 indicates some scope for improvement.
Conclusion
Programming with Scrapy is fun and offers surprising insights. Thanks to the cleverly chosen modularization and good documentation, users can focus on extracting and accumulating data. If you delve deeper into Scrapy, you will also see the multitude of aspects it covers and the professional approach the framework pursues.
Scrapy
Infos
- Scrapy framework: http://scrapy.org
- Python: https://python.org
- Python package index: http://pypi.python.org
- Scrapy docs: http://doc.scrapy.org/en/latest/topics/architecture.html
- Robots exclusion standard: https://en.wikipedia.org/wiki/Robots_Exclusion_Standard
- CSS selectors: https://api.jquery.com/category/selectors/
- XPath expressions: https://www.w3.org/TR/xpath/
- Meaning of
__init__.py
: http://stackoverflow.com/questions/448271/what-is-init-py-for - SQLite: http://sqlite.org
- Information theory: https://en.wikipedia.org/wiki/Information_theory
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome Fans Everywhere Rejoice for the Latest Release
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.