Making an online archive of all your bookmarked pages

Preserve Your Favorite Pages

© Lead Image © Roman Motizov, 123RF.com

© Lead Image © Roman Motizov, 123RF.com

Article from Issue 232/2020
Author(s):

If you have a large collection of bookmarked pages, it's worth protecting! With the right scripts, you can create an archive so you never lose access to all your favorite web pages.

The World Wide Web is so embedded in our lives that we often forget how ephemeral it is. Search engines can find content, but they can't ensure that the content is easily accessible over time [1]. Bookmarks are useful for storing information about websites, but you never know when the page will change or go offline. The only way to be sure you have permanent access to web content is to archive a copy of the page on your own system.

I used to archive web pages with the Scrapbook extension for Firefox [2]. Today, you can do the same thing with add-ons like WebScrapBook [3] or SingleFile [4], but none of these ready-made solutions offer the scale, flexibility, and ease-of-use I need for my personal web archive. I need a solution that can handle thousands of bookmarks and has the following features:

  • A searchable index of all bookmarks, usable from any device
  • A link to a personal, automatically archived copy of the page referenced in the bookmark

Because no single tool provides all the functionality I need, I have created my own solution using the Shaarli [5] bookmark manager, the ArchiveBox [6] self-hosted web archive tool, and a couple of custom scripts to glue it all together.

Shaarli and ArchiveBox

Shaarli describes itself as a "minimalist… database-free bookmarking service." Figures 1 and 2 show the Shaarli main window and picture wall with the default theme.

Figure 1: The Shaarli bookmark manager, as it appears to anonymous users.
Figure 2: Shaarli can also display bookmarks as a snapshot gallery.

Shaarli needs a web server configured to serve PHP pages over HTTPS connections. Once that requirement is satisfied, just upload the Shaarli files where the web server can find them, and then load the Shaarli home page in your browser. Set a user name and password and then click on enable the REST API. Drag the Shaarli bookmarklet to your browser's toolbar. The API setting makes it possible to export bookmarks, and the bookmarklet opens the Shaarli pop-up window shown in Figure 3.

Figure 3: The Shaarli pop-up window to add and tag new bookmarks.

ArchiveBox is a Python-based, command-line front end to the wget and youtube-dl downloaders, the pywb "web recorder," and a headless version of the Chromium browser. To use ArchiveBox, just give it all the URLs you wish for it to archive. ArchiveBox will save copies of the web pages in several formats, including PDF (with JavaScript and multimedia), a single-file WAR archive, and graphic versions. ArchiveBox also generates one index per page of all the formats it archives (see Figure 4), as well as a general index.

Figure 4: ArchiveBox lists all of one web page's archived formats: HTML, PDF, screenshot, and more.

You can install ArchiveBox on Linux in several ways, all well documented on the website, but its dependencies may be a problem. To work well, ArchiveBox needs relatively recent versions of all dependencies. If all the necessary binary packages are not available in the standard repositories of your Linux distribution, the whole process may take more time than you can afford. The alternative to installing all these dependencies separately is to use the official Docker container for ArchiveBox. The script described in this article uses the Docker version, but it will work with native installations of ArchiveBox with changes to one or two lines of code.

Gluing It All Together

Shaarli stores bookmarks as one huge, encoded string inside one of its own source files, Listing 1 is a wrapper script for the shaarli client [7] that lets you export Shaarli bookmarks in JSON format.

Listing 1

The Shaarli Bookmarks Export Script

01 #! /bin/bash
02
03 D=`date +%Y%m%d%H%M%S`
04 source /root/scripts/virtualenvs/shaarli3/bin/activate
05 shaarli -c /root/scripts/shaarli.conf -f json get-links --limit all > $D-shaarli.json
06 exit

Line 3 of Listing 1 saves the current date in the $D variable. Line 4 loads the Python virtual environment in which the Shaarli client needs to work. Line 5 runs the client, telling it to obtain a configuration file from shaarli.conf and save all the bookmarks in JSON format [8]. The result is a file with a name like 20190915113030-shaarli.json that contains one huge JSON string. A snippet of the file looks similar to the following:

[{"id": 2109, "url": "https:flfl
  //example.com/some/web/page/",flfl
  "shorturl": "7IewwA", flfl
  "title": "Web Page Title", ....

This snippet shows all you need to know about the Shaarli JSON format: All the variables of each bookmark ("id", "url", "title", etc.) are stored as key-value pairs, enclosed in double quotes, and separated by colons.

The key element used to integrate Shaarli with ArchiveBox is the unique numeric identifier ("id" in the snippet above) that Shaarli assigns to each bookmark. The internal Shaarli link to edit bookmark number 2109 would have the format ?edit_link=2109.

The Shaarli code that displays those strings appears in the template file shaarli/tpl/default/linklist.html:

<a href="?edit_link={$value.id}" flfl
title="{$strEdit}"><i class="fa flfl
fa-pencil-square-o edit-link"></a>

The task is to add a LOCAL COPY link (Figure 5) that points to a folder with the same name as the ID value (Figure 6):

Figure 5: The Edit icon right below each Shaarli bookmark, and the added link to its LOCAL COPY.
Figure 6: A web page archived with a local URL that matches its Shaarli identifier.
<a href="?edit_link={$value.id}" title="{$strEdit}"></a><ahref="http://example.com/webarchive/{$value.id}/"target="_blank">LOCAL COPY</a>

Because the icon to edit bookmarks is displayed only after the user has logged in, placing the extra code right after the code for the edit icon makes sure that the links to each LOCAL COPY are visible only to logged in users.

At this point, the script in Listing 2 is all you need to actually archive bookmarks inside folders that Shaarli can link to automatically.

Listing 2

shaarlibox

01 #! /bin/bash
02
03 ARCHIVEBOXHOME=/root/archiveboxsandbox
04 SHAARLIHOME='/var/www/html/webarchive'
05 BOOKMARKSLIMIT=10
06 NOW=`date +%Y%m%d%H%M`
07 LOG="$NOW-shaarlibox.log"
08 SHAARLI_JSON=$1  # shaarli bookmarks in JSON format
09
10 rm-rf $ARCHIVEBOXHOME
11 mkdir -p  $ARCHIVEBOXHOME
12 chmod 777 $ARCHIVEBOXHOME
13 cd   $ARCHIVEBOXHOME
14
15 perl -e 'while (<>) {s/"id": (\d+?),.*?"url": "([^"]+?)"/\nLINK|$2|$1\n/g; print}' $SHAARLI_JSON | \
16 grep ^LINK | cut '-d|' -f2- | sort -t '|' -k 2 -n -r > newest_bookmarks_first.csv
17
18 ADDED_BOOKMARKS=1
19
20 while IFS= read -r line
21 do
22 BOOKMARKNUM=`echo $line | cut '-d|' -f2`
23 CURRENT_BOOKMARK=`echo $line | cut '-d|' -f1`
24 printf "SB: \nSB: \nSB: %7.7s bookmark %9.9s : %s;\n" $BOOKMARKNUM "$ADDED_BOOKMARKS/$BOOKMARKSLIMIT" $CURRENT_BOOKMARK >> $LOG
25 printf   "SB: %7.7s edit: https://bookmarks.zona-m.net/?edit_link=%s\n\n" $BOOKMARKNUM $BOOKMARKNUM >> $LOG
26 if [ -d $SHAARLIHOME/$BOOKMARKNUM ]
27 then
28    printf "SB: %7.7salready archived: %s;\n" $BOOKMARKNUM $CURRENT_BOOKMARK >> $LOG
29 else
30    printf "SB: %7.7s now archiving to   : %s\n" $BOOKMARKNUM "$SHAARLIHOME/$BOOKMARKNUM"  >> $LOG
31    echo $CURRENT_BOOKMARK > url_list.csv
32
33    cat url_list.csv  | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" env FETCH_WARC=False  /bin/archive &> archive2shaarli.$BOOKMARKNUM.log
34
35    COREFILES=`find . -type f -name "core.*" | perl -e 'while (<>) {print if m/core\.\d+$/}' | wc -l`
36    if [[ "$COREFILES" -gt "0" ]]
37    then
38   printf "SB: %7.7s WARNING! CORE(s) in: %s\n" $BOOKMARKNUM "$SHAARLIHOME/$BOOKMARKNUM"  >> $LOG
39   rm -f archive/*/core.* ; rm -f archive/core.*
40    fi
41    printf "\n##############################################\n\n" >> $LOG
42    cat archive2shaarli.$BOOKMARKNUM.log >> $LOG
43    printf "\n#############à################################\n\n" >> $LOG
44
45    ARCHIVEBOXFOLDER=`find archive -type d | grep / | cut -d/ -f2 | sort | uniq`
46
47    FAILURE=`grep -c '<title>Not yet archived...</title>' archive/$ARCHIVEBOXFOLDER/index.html`
48    if [[ "$FAILURE" -ge "1" ]]
49    then
50   printf "SB: %7.7s failed to archive %s\n" $BOOKMARKNUM $CURRENT_BOOKMARK >> $LOG
51   mkdir $SHAARLIHOME/$BOOKMARKNUM
52     echo "<html><title>Failed to archive $CURRENT_BOOKMARK</title></head><body><pre>" > $SHAARLIHOME/$BOOKMARKNUM/index.html
53   cat archive2shaarli.$BOOKMARKNUM.log >> $SHAARLIHOME/$BOOKMARKNUM/index.html
54     echo '</pre></body></html>'  >> $SHAARLIHOME/$BOOKMARKNUM/index.html
55   rm -rf  archive/$ARCHIVEBOXFOLDER
56    else
57   printf "SB: %7.7s moving copy from   : archive/$ARCHIVEBOXFOLDER to $SHAARLIHOME/$BOOKMARKNUM\n" $BOOKMARKNUM >> $LOG
58   mv archive/$ARCHIVEBOXFOLDER $SHAARLIHOME/$BOOKMARKNUM
59   perl -pi.bak -e 's/\.\.\/\.\.\/static\//..\/static\//g' $SHAARLIHOME/$BOOKMARKNUM/index.html
60   rm $SHAARLIHOME/$BOOKMARKNUM/index.html.bak
61   mv archive2shaarli.$BOOKMARKNUM.log $SHAARLIHOME/$BOOKMARKNUM/
62     fi
63    printf "SB: %7.7s see local copy: http://zona-m.net/marco/webarchive/%s\n" $BOOKMARKNUM $BOOKMARKNUM  >> $LOG
64    ((ADDED_BOOKMARKS++))
65    rm -rf sources static index.html index.json robots.txt
66 fi
67 if [[ "$ADDED_BOOKMARKS" -gt "BOOKMARKSLIMIT" ]]
68 then
69    break
70 fi
71 done < newest_bookmarks_first.csv
72 mv $LOG ../
73 exit

The main phases of the shaarlibox script in Listing 2 are as follows:

  1. Set or load configuration variables (lines 3 to 8).
  2. Create a "sandbox" for the container to run into, and make it writable to everybody (lines 10 and 12).
  3. Load ID number and URLs of all bookmarks (line 15 and 16).
  4. Archive all those URLs, one at a time, with the official Docker image for ArchiveBox (the loop in lines 20 to 71).
  5. Move the main logfile to the parent folder and exit (lines 72 and 73).

The sandbox is the $ARCHIVEBOXHOME, which is recreated from scratch every time the script runs. Making that folder world-writable (line 12) is necessary because the container creates folders and files with user IDs.

$SHAARLIHOME is a web-accessible folder where the archived copies are saved. You can give it any value – as long as it corresponds to the example.com/webarchive string you added to the Shaarli template file.

$ADDED_BOOKMARKS in line 18 counts how many bookmarks have been archived. Lines 67 to 70 terminate the script as soon as that counter equals the value of $BOOKMARKSLIMIT set in line 5. I strongly suggest that you set this variable to a very low value, say 10 or 20, the first time you run the script. In this way, you can quickly get an idea of how much time and space it would take to archive everything.

The main logfile ($LOG) has a unique prefix, corresponding to the date and hour when the script starts (lines 6 and 7). Finally, $SHAARLI_JSON is the file generated by the shaarli backup script in Listing 1, which is passed as the first argument to shaarlibox.

The pipeline of text-processing commands in Phase 3 work as follows: first, perl, grep, and cut extract the id and url fields of all bookmarks from the JSON file. Their final output is then sorted in reverse numerical order (placing the most recent bookmarks first), using the id numbers as keys. To see what the options of each command mean, please consult the corresponding man pages. The final result, which is saved in the file newest_bookmarks_first.csv, has this format (the URLs are trimmed for clarity):

https://www.zdnet.com/article/linux-foundation..../|296
https://www.linux.com/audience/devops/...|295
https://www.linuxuprising.com/2019/08/...|294

The main loop (lines 20 to 71) reads that file, one line at a time, saving the current ID and URL values in $BOOKMARKNUM and $CURRENT_BOOKMARK (lines 22 and 23). All the printf statements in the loop write what is happening to one main log file ($LOG) for later analysis and debugging. Lines 26 to 28 check if $SHAARLIHOME already contains a sub-folder called $BOOKMARKNUM, which means that the bookmark was already archived in a previous run, so nothing else happens. Otherwise, the script writes the $CURRENT_BOOKMARK inside the file url_list.csv (line 31) that will serve as the input for ArchiveBox.

You might wish to modify the script to write, say, 10 or 20 URLs at a time inside newest_bookmarks_first.csv and then fetch them all with just one call of the ArchiveBox container. Calling that container once per bookmark slows the script down. However, if the script downloads just one bookmark per container call, with $BOOKMARKSLIMIT set to a very low value, it is much quicker to check if you are using the best combination of options for ArchiveBox before letting it loose on all your bookmarks.

The actual archiving happens in line 33. This command, which was copied from the documentation, downloads the official container image of ArchiveBox and all its dependencies and runs it with Docker. My only additions are the optional settings WGET_USER_AGENT and FETCH_WARC. The first tells the wget download utility inside the container to identify itself as some desktop browser whenever it visits a website. This step is necessary, because some websites will refuse to serve pages to automatic downloaders. The FETCH_WARC setting prevents ArchiveBox from creating WARC archives of each bookmark, to save some disk space.

Just like Shaarli, ArchiveBox gives to each URL it processes a unique numeric identifier (which I decided to call $ARCHIVEBOXFOLDER) and saves all the copies in a folder with the same name, inside the local archive directory. Since only one URL is processed every time, whatever subdfolder there is in archive is named for the numeric identifier, and finding its value is the task of line 45.

When ArchiveBox archives a web page inside $ARCHIVEBOXFOLDER, the index.html file inside that directory has that page's URL if the archive process succeeds; otherwise it is set to Not yet archived…. If such a string is found in the index.html file, the $FAILURE flag is set to 1 (line 47). When that happens, the code in lines 49 to 55 creates a new folder in $SHAARLIHOME/$BOOKMARKNUM and saves the ArchiveBox logfile inside it, in HTML format. In this way, if I click on the LOCAL COPY link for the bookmark inside Shaarli, I will see what happened (Figure 7), even if I missed it in the general $LOG file.

Figure 7: The error page generated by the script when ArchiveBox cannot save a bookmark.

If $FAILURE is zero, it means that ArchiveBox succeeded, and all the versions of the current bookmark are saved and indexed inside archive/$ARCHIVEBOXFOLDER. So, aside from logging the event, all that is left to do is (lines 56 to 66) to move archive/$ARCHIVEBOXFOLDER to $SHAARLIHOME/$BOOKMARKNUM (which is where the LOCAL COPY link in Shaarli points), fix some relative links of the index.html file (lines 59 and 60, see below for details), increment the $ADDED_BOOKMARKS counter, and remove all the auxiliary files left by ArchiveBox (line 65). This last step is necessary because if the next run of ArchiveBox found those files, it would create extra folders, which would break the simple command used in line 45 to find $ARCHIVEBOXFOLDER.

The first time ArchiveBox runs, or believes it is running, it creates one static folder, at the same level as the archive folder, where it puts all the icons and CSS files used by the indexes of each page. Therefore, all the links to those resources inside each index.html file have the form ../../static/. Since I needed the static folder to be inside $SHAARLIHOME, all those strings must change from ../../static to ../static, which is the purpose of the Perl command in line 59.

General Set Up and Maintenance

The script in Listing 2 makes one ArchiveBox container call per bookmark. This makes it noticeably slower, but much simpler and much more future-proof. As is, if you ever want to replace ArchiveBox with another archiver, you only have to rewrite two lines of code (33 and 45)! Besides, if ArchiveBox processed 10 or 20 bookmarks in each run it would be more complicated to match them with their Shaarli identifiers, and much easier to fill your hard drive, or miss downloading errors.

Due to lack of space, I can only briefly discuss three possible improvements that will strengthen this bookmark archiving system: privacy, backups, and disk space. Shaarli bookmarks can be all private by default, but the archive itself will be private only if you password-protect the $SHAARLIHOME folder at the web server level. For backups, instead, the only specific information you need is which Shaarli files to back up [9].

Using the options in Listing 2, archiving about 2,100 bookmarks created over 230K files on my server, requiring more than 23GB. You can save lots of space, however, by choosing an efficient archive format and using a tool like rdfind [10] to replace duplicate files with hard links.

Long-term maintenance of the whole system is relatively simple, but necessary. On the ArchiveBox side, you might need to update the shaarlibox script if new versions use different command-line options or directory structures. With Shaarli, whenever you upgrade to a new version, you will need to manually patch the template.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus