An XML, HTML, and JSON data extraction tool
Easy Extraction
Xidel lets you easily extract and process data from XML, HTML, and JSON documents.
There are numerous ways to scrape a web page for data. In fact, the right mix of Python modules and Python logic glue could probably do the trick, but sometimes you just want a convenient tool that lets you extract data from websites. Xidel [1], a multi-platform command-line tool, offers a one-stop alternative to quickly extract, process, and save data from XML, HTML, or JSON documents.
Under the Hood
Xidel wraps XQuery, XPath, and JSON into one convenient front end. XQuery, a W3C Recommendation since 2007, lets you query XML or HTML files as if they were database servers, process the extracted data as desired, and save data to other files. As shown in the XQuery tutorial [2], XQuery-capable software can complete requests like finding all the CDs in an online catalog that cost less than $10, sorted by release date.
Xidel also fully supports the other W3C Recommendations, XPath [3] and the data-interchange language JavaScript Object Notation (JSON) [4]. XPath defines both a syntax for identifying all the elements of an XML document and a library of standard functions that make it easy to navigate through such elements and extract them. JSON data structures represent any kind of data as objects made of unordered sets of name/value pairs (I'll show some examples of this later on in this article).
Installation
You can download Xidel from the website [1] with just a few clicks. Xidel offers the choice between a binary package in DEB format or a ZIP archive that contains just five files: a digital certificate, the changelog, an exhaustive README file that explains in detail how Xidel works, the executable program, and its installer. The installer (Listing 1) should be run with administrator privileges. At 11 lines, the installer could hardly be simpler.
Listing 1
Installation Script
01 #!/bin/bash 02 PREFIX=$1 03 sourceprefix= 04 if [[ -d programs/internet/xidel/ ]]; then sourceprefix=programs/internet/xidel/; else sourceprefix=./; fi 05 mkdir -p $PREFIX/usr/bin 06 07 install -v $sourceprefix/xidel $PREFIX/usr/bin 08 if [[ -f $sourceprefix/meta/cacert.pem ]]; then 09 mkdir -p $PREFIX/usr/share/xidel 10 install -v $sourceprefix/meta/cacert.pem $PREFIX/usr/share/xidel/; 11 fi
Listing 1 sets as the installation $PREFIX
the directory passed as the first argument (line 2). On my computer, I chose the root folder (/
), but you may prefer to use /opt
or similar locations. Next, the script just uses the install
program to copy the xidel
executable and its certificate in $PREFIX
's usr/bin
and, respectively, usr/share/xidel
subdirectories.
When I tried to launch the program after running the installer, I discovered that Xidel needs the developer versions of libopenssl and libcrypto (I couldn't find this problem documented at the time of writing). However, both libraries are available as native packages in the standard repositories of most distributions (e.g., libssl-dev on Debian derivatives, and openssl-devel on Fedora-based systems), so installing them takes a matter of minutes.
Main Features
Xidel can interact with websites if it has the proper data and instructions. It can log into websites on your behalf to perform tasks like updating personal information, submitting forms, or downloading private messages. Among other things, Xidel can reach websites using proxies, manage cookies, and pause between connections to prevent overloading servers and subsequently being banned. However, I do not cover these specific Xidel features for one simple reason: Websites change all the time, so any specific examples would be completely obsolete by the time you read this article. If you want to know how Xidel can, for example, handle your Reddit notifications, I recommend first checking the latest examples on the Xidel website and then if necessary asking for support on the Xidel mailing list (which I did to write this article).
As far as automatic data processing is concerned, Xidel reads and parses standard input or plain text files in JSON, XML, and HTML formats. After processing their content according to your instructions, Xidel can output the result in the same formats, as well as plain text or, as I will show later, shell variables. In addition, you can define the output separator between multiple items and create custom headers and footers for your data reports.
Xidel's two main modes, extract
and follow
, are often used together. In a nutshell, the extract
mode extracts and processes data from the current document, if you just need to process the data inside one or more local files or web pages. The follow
mode starts where extract
leaves off by following all the links found by previous operations in order to download and process the links' content.
Xidel can run multiple extract
and follow
actions in the same call, as long as you write them in the right order and never ask to follow data that was not directly passed to Xidel or found by previous extract
operations.
In extract
mode, Xidel can recognize and select document elements by their CSS. If you want to process the extracted data, Xidel uses XPath 3.0 expressions. For more complex tasks, you can use the full XQuery standard to make Xidel run Turing-complete scripts, which StackExchange describes as "any algorithm you could think of, no matter how complex" [5].
However, when it's necessary to simultaneously extract multiple pieces of data at once, many times, from specific sections of pages with a fixed structure (e.g, titles and links of the most viewed topics in a forum), I recommend pattern matching, which I will discuss later.
Syntax-wise, as you will see in the examples I provide later, Xidel extract
commands are one-liners that first pass to Xidel the file it should process and then, with the --extract=
or -e
option, a string that contains the actual operations to perform on the given document. When that string becomes so long that it's difficult to edit it on the command line, or you want to save it, you can write it to a file and pass the file to Xidel with the --extract-file
option.
The option for the follow
mode is --follow=
or -f
. As with extract
, this option gives Xidel the expression that describes which element or sequence of elements should be followed. There are many other options for the follow
mode, but with one exception they are almost all mirror versions of the extract
options (e.g., you can save your commands in a file and pass it to Xidel with --follow-file
). The exception, --follow-level
, specifies the maximum recursion level when following pages from other pages. Set this carefully, because its default value is 99,999!
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.