Transform web pages into EPUB files

Read at Will

© Photo by Gülfer ERGIN

© Photo by Gülfer ERGIN

Article from Issue 260/2022
Author(s):

Instead of relying on a third-party read-it-later service, you can use this DIY tool to save articles from the Internet in a format that meets your specific needs.

Few of us have time to read long-form web articles during the day, which is why services that let you save interesting reads for later can come in handy. Popular services such as Pocket and Instapaper even offer apps you can use to read the saved content offline on your preferred device. Better still, the saved articles are reformatted for better readability and scrubbed of all ads, scripts, trackers, and other junk.

Hosted services are like restaurants, though. No matter how great the food and the service, you eventually start longing for home-cooked meals, not only because cooking at home is cheaper and more convenient, but because you can make any dish you wish just the way you like it and have fun in the process. In a similar vein, why settle for a ready-made, read-it-later service, when you can cook up your very own solution with a bit of creative thinking, the right mix of open source tools, and a dash of shell scripting magic? That's exactly what is on today's menu: a DIY read-it-later tool.

Instead of saving and serving slimmed down versions of web pages, this DIY read-it-later application is going to process pages and transform them into ePub files. This way, you can read the saved content on practically any device, and you can choose whatever ebook reading app you like. Because the DIY read-it-later tool is a simple shell script that relies on Linux tools, you don't need a server to host it. If necessary, you can run the tool on a remote Linux machine and serve ePub files via a dedicated Open Publication Distribution System (OPDS) server or simply publish the files on the web. In short, the DIY read-it-later tool gives you plenty of room for experimenting and setting up the solution that works best for your specific needs. Moreover, the fact that an ePub file is essentially a ZIP archive containing an XHTML file along with stylesheets, fonts, and so on makes the saved content future-proof and editable.

Preparatory Work

You don't have to code the DIY read-it-later tool from scratch, because I've already done the hard work for you and published the fruits of my labor, readiculous.sh, on GitHub [1]. All you need to do is download the source code as a ZIP archive and unpack it, or clone the project's Git repository using the command:

git clone https://github.com/dmpop/readiculous.git

Before getting down to the nitty-gritty, you need to do some preparatory work. The first order of business is to install the required software. The main readiculous.sh shell script relies on Pandoc, ImageMagick, jq, wget, and Go-Readability [2]. With the exception of Go-Readability, all of these dependencies are available in the official software repositories of most mainstream Linux distributions, so you can install them using the default package manager. To do this on Debian or an Ubuntu-based distribution, run the command:

sudo apt install pandoc imagemagick §
  jq wget

The source code on GitHub [1] includes a binary version of the Go-Readability tool compiled for the x86_64 architecture. If you plan to use the script on any other platform, or you want to have the very latest version of the tool, you will have to compile it yourself. Fortunately, it's a rather straightforward thing to do. Install the Go language package (use the sudo apt install golang command on Debian and Ubuntu), and then run the following command to compile the command-line version of Go-Readability:

go get -u -v github.com/go-shiori/go-readability/cmd/...

Once the compiling process is finished, you'll find the resulting binary in the ~/go/bin directory. Move the binary file into the readiculous directory, and you're done.

How It Works

The readiculous.sh script (Listing 1) starts working by fetching the desired page, scrubbing it clean, and reformatting it for better readability. To do all that, the script uses the nifty Go-Readability tool. Go-Readability also extracts the page title and passes it to ImageMagick, which creates a cover image with the obtained title. Finally, the Pandoc tool transforms the saved page into an ePub file complete with the generated cover.

Listing 1

readiculous.sh

01 #!/usr/bin/env bash
02 if [ ! -x "$(command -v convert)" ] || [ ! -x "$(command -v pandoc)" ] || [ ! -x "$(command -v jq)" ]; then
03   echo "Make sure that the required tools are installed"
04   exit 1
05 fi
06
07 # Usage prompt
08 usage() {
09   cat <<EOF
10 $0 [OPTIONS]
11 ------
12 $0 transforms web pages pages into readable EPUB files.
13
14 USAGE:
15 ------
16   $0 -u <URL> -d <dir> -m auto
17
18 OPTIONS:
19 --------
20   -u Source URL
21   -d Destination directory (optional)
22   -m Enable auto mode (optional)
23
24 EXAMPLES:
25 ---------
26 $0 -u https://psyche.co/guides/how-to-approach-the-lifelong-project-of-language-learning -d "Language"
27 $0 -m auto
28
29 EOF
30   exit 1
31 }
32
33 #Read the specified parameters
34 while getopts "u:d:m:" opt; do
35   case ${opt} in
36   u)
37     url=$OPTARG
38     ;;
39   d)
40     dir=$OPTARG
41     ;;
42   m)
43     mode=$OPTARG
44     ;;
45   \?)
46     usage
47     ;;
48   esac
49 done
50 shift $((OPTIND - 1))
51
52 if [ ! -z "$dir" ]; then
53   dir=Library/"$dir"
54 else
55   dir=Library
56 fi
57 mkdir -p "$dir"
58
59 readicule() {
60   # Extract title and image from the specified URL
61   title=$(./go-readability -m $url | jq '.title' | tr -d \")
62   # Generate a readable HTML file
63   ./go-readability $url >>"$dir/$title".html
64   # Generate a cover
65   wget -q https://picsum.photos/800/1024 -O cover.jpg
66   convert -background '#0008' -font Arvo -pointsize 35 -fill white -gravity center -size 800x150 caption:"$title" cover.jpg +swap -gravity south -composite cover.jpg
67   if [ -z "$title" ]; then
68     title="This is Readiculous!"
69   fi
70   # convert HTML to EPUB
71   pandoc -f html -t epub --metadata title="$title" --metadata creator="Readiculous" --metadata publisher="$url" --css=stylesheet.css --epub-cover-image=cover.jpg -o "$dir/$title".epub "$dir/$title".html
72   rm cover.jpg "$dir/$title".html
73   echo
74   echo ">>> '$title' has been saved in '$dir'"
75   echo
76 }
77
78 # If "-m auto" is specified
79 if [ "$mode" = "auto" ]; then
80   file="links.txt"
81   if [ ! -f "$file" ]; then
82     echo "$file not found."
83     exit 1
84   fi
85   # Read the contents of the links.txt file line-by-line
86   while IFS="" read -r url || [ -n "$url" ]; do
87     readicule
88   done <"$file"
89   rm links.txt
90   exit 1
91 fi
92
93 if [ -z "$url" ]; then
94   usage
95 fi
96
97 readicule

The script accepts three parameters: -u, -d, and -m. The mandatory -u parameter specifies the URL of the target page, while the optional -d parameter determines in which subdirectory the resulting ePub file should be saved. If the -d parameter is omitted, the script saves ePub files in the default Library directory. By specifying the subfolder, you can automatically sort the created ePub files by topic (for example, Language, Travel, Long Reads, and so on), or any other criteria. The -m parameter allows you to convert several saved URLs at once, but I'll take a closer look at it later. The script uses a combination of the getopts tool, the do...done loop, and the case in control structure to read the values passed by the specified parameters and assign these values to variables (lines 34-50 in Listing 1). If the default Library directory doesn't exist, the script creates it (lines 52-57).

Listing 1's readicule() function does the actual work. First, Go-Readability obtains the metadata of the specified page. The metadata is returned in the JSON format, and the jq tool extracts the title, while the tr tool strips double quotes (line 61). The same Go-Readability tool fetches the page using the specified URL and saves the processed version as an HTML file (line 63).

The next step is to create a cover for use with the ePub file. Strictly speaking, covers are not necessary, but they do make it easier to find the file you need in the library, and they make the ePub file look less bland. To generate a cover, the script uses the wget tool for fetching a random 1024x800 image from the Lorem Picsum service and saves the file as cover.jpg (line 65). Then, the convert tool superimposes the obtained title onto the cover image (line 66).

There are, of course, plenty of other ways to create covers if you don't want the script to rely on a third-party service. For example, you can create covers with random background colors. To do this, you need to tweak the script so that it generates three random numbers between   and 255. The convert tool can then use the numbers as red, green, and blue values for generating a cover:

r=$(shuf -i 0-255 -n 1)
g=$(shuf -i 0-255 -n 1)
b=$(shuf -i 0-255 -n 1)
convert -size 800x1024 xc:rgb\($r,$g,$b\) cover.jpg

If solid colors are not your cup of tea, you can use the convert tool to generate a random colorful fractal image and specify the -paint and -blur options for a more artistic effect:

convert -size 800x1024 plasma:fractal -paint 10 -blur 10x20 cover.png

Finally, Pandoc finishes the task. It assembles the saved HTML file, the generated cover, and the obtained data into an ePub file and saves it either in the default directory (line 71) or in the subdirectory specified by the -d parameter.

But that's not all. If you read a lot, running the script every time you want to save a page for later can quickly become a nuisance. That's why the script also features the -m parameter. When specified with the auto value, the script picks URLs from the links.txt file one by one and generates ePub files for each one. The if...then...fi block that starts on line 79 checks whether the $mode value is set to auto. If so, the while...do loop (lines 86-90) reads URLs from the links.txt file and calls the readicule() function to generate ePub files. If the $mode value is not specified, the script simply calls the function to generate an ePub file using the URL passed by the -u parameter.

To speed up the process of transforming articles into ePub files, you can create a simple helper script:

#!/usr/bin/env bash
url=$(xclip -o)
echo $url
cd /path/to/readiculous
./readiculous.sh -u $url
notify-send "Added to Readiculous"

Replace /path/to/readiculous with the actual path to the readiculous directory, and save the script under an appropriate name (for example, add-to-readiculous.sh). Install the xclip tool on your system, and assign a keyboard shortcut to the script.

The Matter of Reading

Saving articles in the ePub format means that you read them using practically any device on any platform. Better yet, if you use Apple Books or Google Books, you can take advantage of the features these apps offer, including synchronization across multiple devices, saving highlights, library management functionality, and more.

However, if you've gone to the trouble of rolling out your own read-it-later tool, it probably doesn't make much sense to use a third-party commercial platform for reading. Enter KOReader [3], an open source ebook reader application available for Linux, Android, and a slew of dedicated readers. Despite its deceptively simple interface, KOReader packs an impressive array of features, including syncing, highlights, gesture support, note-taking capabilities, extensions, and much, much more (Figure 1). So if you want to keep your entire read-it-later toolchain open source, you should use KOReader.

Figure 1: KOReader is arguably the most powerful open source ebook reader on any platform.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus