An XML, HTML, and JSON data extraction tool

Easy Extraction

© Lead Image © Wutthichai Luemuang, 123RF.com

© Lead Image © Wutthichai Luemuang, 123RF.com

Article from Issue 276/2023
Author(s):

Xidel lets you easily extract and process data from XML, HTML, and JSON documents.

There are numerous ways to scrape a web page for data. In fact, the right mix of Python modules and Python logic glue could probably do the trick, but sometimes you just want a convenient tool that lets you extract data from websites. Xidel [1], a multi-platform command-line tool, offers a one-stop alternative to quickly extract, process, and save data from XML, HTML, or JSON documents.

Under the Hood

Xidel wraps XQuery, XPath, and JSON into one convenient front end. XQuery, a W3C Recommendation since 2007, lets you query XML or HTML files as if they were database servers, process the extracted data as desired, and save data to other files. As shown in the XQuery tutorial [2], XQuery-capable software can complete requests like finding all the CDs in an online catalog that cost less than $10, sorted by release date.

Xidel also fully supports the other W3C Recommendations, XPath [3] and the data-interchange language JavaScript Object Notation (JSON) [4]. XPath defines both a syntax for identifying all the elements of an XML document and a library of standard functions that make it easy to navigate through such elements and extract them. JSON data structures represent any kind of data as objects made of unordered sets of name/value pairs (I'll show some examples of this later on in this article).

Installation

You can download Xidel from the website [1] with just a few clicks. Xidel offers the choice between a binary package in DEB format or a ZIP archive that contains just five files: a digital certificate, the changelog, an exhaustive README file that explains in detail how Xidel works, the executable program, and its installer. The installer (Listing 1) should be run with administrator privileges. At 11 lines, the installer could hardly be simpler.

Listing 1

Installation Script

01 #!/bin/bash
02 PREFIX=$1
03 sourceprefix=
04 if [[ -d programs/internet/xidel/ ]]; then sourceprefix=programs/internet/xidel/; else sourceprefix=./;  fi
05 mkdir -p $PREFIX/usr/bin
06
07 install -v $sourceprefix/xidel $PREFIX/usr/bin
08 if [[ -f $sourceprefix/meta/cacert.pem ]]; then
09 mkdir -p $PREFIX/usr/share/xidel
10 install -v $sourceprefix/meta/cacert.pem $PREFIX/usr/share/xidel/;
11 fi

Listing 1 sets as the installation $PREFIX the directory passed as the first argument (line 2). On my computer, I chose the root folder (/), but you may prefer to use /opt or similar locations. Next, the script just uses the install program to copy the xidel executable and its certificate in $PREFIX's usr/bin and, respectively, usr/share/xidel subdirectories.

When I tried to launch the program after running the installer, I discovered that Xidel needs the developer versions of libopenssl and libcrypto (I couldn't find this problem documented at the time of writing). However, both libraries are available as native packages in the standard repositories of most distributions (e.g., libssl-dev on Debian derivatives, and openssl-devel on Fedora-based systems), so installing them takes a matter of minutes.

Main Features

Xidel can interact with websites if it has the proper data and instructions. It can log into websites on your behalf to perform tasks like updating personal information, submitting forms, or downloading private messages. Among other things, Xidel can reach websites using proxies, manage cookies, and pause between connections to prevent overloading servers and subsequently being banned. However, I do not cover these specific Xidel features for one simple reason: Websites change all the time, so any specific examples would be completely obsolete by the time you read this article. If you want to know how Xidel can, for example, handle your Reddit notifications, I recommend first checking the latest examples on the Xidel website and then if necessary asking for support on the Xidel mailing list (which I did to write this article).

As far as automatic data processing is concerned, Xidel reads and parses standard input or plain text files in JSON, XML, and HTML formats. After processing their content according to your instructions, Xidel can output the result in the same formats, as well as plain text or, as I will show later, shell variables. In addition, you can define the output separator between multiple items and create custom headers and footers for your data reports.

Xidel's two main modes, extract and follow, are often used together. In a nutshell, the extract mode extracts and processes data from the current document, if you just need to process the data inside one or more local files or web pages. The follow mode starts where extract leaves off by following all the links found by previous operations in order to download and process the links' content.

Xidel can run multiple extract and follow actions in the same call, as long as you write them in the right order and never ask to follow data that was not directly passed to Xidel or found by previous extract operations.

In extract mode, Xidel can recognize and select document elements by their CSS. If you want to process the extracted data, Xidel uses XPath 3.0 expressions. For more complex tasks, you can use the full XQuery standard to make Xidel run Turing-complete scripts, which StackExchange describes as "any algorithm you could think of, no matter how complex" [5].

However, when it's necessary to simultaneously extract multiple pieces of data at once, many times, from specific sections of pages with a fixed structure (e.g, titles and links of the most viewed topics in a forum), I recommend pattern matching, which I will discuss later.

Syntax-wise, as you will see in the examples I provide later, Xidel extract commands are one-liners that first pass to Xidel the file it should process and then, with the --extract= or -e option, a string that contains the actual operations to perform on the given document. When that string becomes so long that it's difficult to edit it on the command line, or you want to save it, you can write it to a file and pass the file to Xidel with the --extract-file option.

The option for the follow mode is --follow= or -f. As with extract, this option gives Xidel the expression that describes which element or sequence of elements should be followed. There are many other options for the follow mode, but with one exception they are almost all mirror versions of the extract options (e.g., you can save your commands in a file and pass it to Xidel with --follow-file). The exception, --follow-level, specifies the maximum recursion level when following pages from other pages. Set this carefully, because its default value is 99,999!

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • JSON Deep Dive

    JSON data format is a standard feature of today's Internet – and a common option for mobile and desktop apps – but many users still regard it as something of a mystery. We'll take a close look at JSON format and some of the free tools you can use for reading and manipulating JSON data.

  • Create a Personal Web Archive

    If you have a large collection of bookmarked pages, it's worth protecting! With the right scripts, you can create an archive so you never lose access to all your favorite web pages.

  • Migrating Music

    Use a Python API to migrate a music library from SQL to a NoSQL document database.

  • File Inspector

    Spotify, the Internet music service, collects data about its users and their taste in music. Mike Schilli requested a copy of his files to investigate them with Go.

  • Jasonette

    Jasonette makes it supremely easy to build simple and advanced Android apps with a minimum of coding.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News