Making an online archive of all your bookmarked pages
Preserve Your Favorite Pages
If you have a large collection of bookmarked pages, it's worth protecting! With the right scripts, you can create an archive so you never lose access to all your favorite web pages.
The World Wide Web is so embedded in our lives that we often forget how ephemeral it is. Search engines can find content, but they can't ensure that the content is easily accessible over time [1]. Bookmarks are useful for storing information about websites, but you never know when the page will change or go offline. The only way to be sure you have permanent access to web content is to archive a copy of the page on your own system.
I used to archive web pages with the Scrapbook extension for Firefox [2]. Today, you can do the same thing with add-ons like WebScrapBook [3] or SingleFile [4], but none of these ready-made solutions offer the scale, flexibility, and ease-of-use I need for my personal web archive. I need a solution that can handle thousands of bookmarks and has the following features:
- A searchable index of all bookmarks, usable from any device
- A link to a personal, automatically archived copy of the page referenced in the bookmark
Because no single tool provides all the functionality I need, I have created my own solution using the Shaarli [5] bookmark manager, the ArchiveBox [6] self-hosted web archive tool, and a couple of custom scripts to glue it all together.
Shaarli and ArchiveBox
Shaarli describes itself as a "minimalist… database-free bookmarking service." Figures 1 and 2 show the Shaarli main window and picture wall with the default theme.
Shaarli needs a web server configured to serve PHP pages over HTTPS connections. Once that requirement is satisfied, just upload the Shaarli files where the web server can find them, and then load the Shaarli home page in your browser. Set a user name and password and then click on enable the REST API. Drag the Shaarli bookmarklet to your browser's toolbar. The API setting makes it possible to export bookmarks, and the bookmarklet opens the Shaarli pop-up window shown in Figure 3.
ArchiveBox is a Python-based, command-line front end to the wget
and youtube-dl
downloaders, the pywb
"web recorder," and a headless version of the Chromium browser. To use ArchiveBox, just give it all the URLs you wish for it to archive. ArchiveBox will save copies of the web pages in several formats, including PDF (with JavaScript and multimedia), a single-file WAR archive, and graphic versions. ArchiveBox also generates one index per page of all the formats it archives (see Figure 4), as well as a general index.
You can install ArchiveBox on Linux in several ways, all well documented on the website, but its dependencies may be a problem. To work well, ArchiveBox needs relatively recent versions of all dependencies. If all the necessary binary packages are not available in the standard repositories of your Linux distribution, the whole process may take more time than you can afford. The alternative to installing all these dependencies separately is to use the official Docker container for ArchiveBox. The script described in this article uses the Docker version, but it will work with native installations of ArchiveBox with changes to one or two lines of code.
Gluing It All Together
Shaarli stores bookmarks as one huge, encoded string inside one of its own source files, Listing 1 is a wrapper script for the shaarli
client [7] that lets you export Shaarli bookmarks in JSON format.
Listing 1
The Shaarli Bookmarks Export Script
01 #! /bin/bash 02 03 D=`date +%Y%m%d%H%M%S` 04 source /root/scripts/virtualenvs/shaarli3/bin/activate 05 shaarli -c /root/scripts/shaarli.conf -f json get-links --limit all > $D-shaarli.json 06 exit
Line 3 of Listing 1 saves the current date in the $D
variable. Line 4 loads the Python virtual environment in which the Shaarli client needs to work. Line 5 runs the client, telling it to obtain a configuration file from shaarli.conf
and save all the bookmarks in JSON format [8]. The result is a file with a name like 20190915113030-shaarli.json
that contains one huge JSON string. A snippet of the file looks similar to the following:
[{"id": 2109, "url": "https:flfl //example.com/some/web/page/",flfl "shorturl": "7IewwA", flfl "title": "Web Page Title", ....
This snippet shows all you need to know about the Shaarli JSON format: All the variables of each bookmark ("id"
, "url"
, "title"
, etc.) are stored as key-value pairs, enclosed in double quotes, and separated by colons.
The key element used to integrate Shaarli with ArchiveBox is the unique numeric identifier ("id"
in the snippet above) that Shaarli assigns to each bookmark. The internal Shaarli link to edit bookmark number 2109 would have the format ?edit_link=2109
.
The Shaarli code that displays those strings appears in the template file shaarli/tpl/default/linklist.html
:
<a href="?edit_link={$value.id}" flfl title="{$strEdit}"><i class="fa flfl fa-pencil-square-o edit-link"></a>
The task is to add a LOCAL COPY link (Figure 5) that points to a folder with the same name as the ID value (Figure 6):
<a href="?edit_link={$value.id}" title="{$strEdit}"></a><ahref="http://example.com/webarchive/{$value.id}/"target="_blank">LOCAL COPY</a>
Because the icon to edit bookmarks is displayed only after the user has logged in, placing the extra code right after the code for the edit icon makes sure that the links to each LOCAL COPY
are visible only to logged in users.
At this point, the script in Listing 2 is all you need to actually archive bookmarks inside folders that Shaarli can link to automatically.
Listing 2
shaarlibox
01 #! /bin/bash 02 03 ARCHIVEBOXHOME=/root/archiveboxsandbox 04 SHAARLIHOME='/var/www/html/webarchive' 05 BOOKMARKSLIMIT=10 06 NOW=`date +%Y%m%d%H%M` 07 LOG="$NOW-shaarlibox.log" 08 SHAARLI_JSON=$1 # shaarli bookmarks in JSON format 09 10 rm-rf $ARCHIVEBOXHOME 11 mkdir -p $ARCHIVEBOXHOME 12 chmod 777 $ARCHIVEBOXHOME 13 cd $ARCHIVEBOXHOME 14 15 perl -e 'while (<>) {s/"id": (\d+?),.*?"url": "([^"]+?)"/\nLINK|$2|$1\n/g; print}' $SHAARLI_JSON | \ 16 grep ^LINK | cut '-d|' -f2- | sort -t '|' -k 2 -n -r > newest_bookmarks_first.csv 17 18 ADDED_BOOKMARKS=1 19 20 while IFS= read -r line 21 do 22 BOOKMARKNUM=`echo $line | cut '-d|' -f2` 23 CURRENT_BOOKMARK=`echo $line | cut '-d|' -f1` 24 printf "SB: \nSB: \nSB: %7.7s bookmark %9.9s : %s;\n" $BOOKMARKNUM "$ADDED_BOOKMARKS/$BOOKMARKSLIMIT" $CURRENT_BOOKMARK >> $LOG 25 printf "SB: %7.7s edit: https://bookmarks.zona-m.net/?edit_link=%s\n\n" $BOOKMARKNUM $BOOKMARKNUM >> $LOG 26 if [ -d $SHAARLIHOME/$BOOKMARKNUM ] 27 then 28 printf "SB: %7.7salready archived: %s;\n" $BOOKMARKNUM $CURRENT_BOOKMARK >> $LOG 29 else 30 printf "SB: %7.7s now archiving to : %s\n" $BOOKMARKNUM "$SHAARLIHOME/$BOOKMARKNUM" >> $LOG 31 echo $CURRENT_BOOKMARK > url_list.csv 32 33 cat url_list.csv | docker run -i -v $ARCHIVEBOXHOME:/data nikisweeting/archivebox env WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" env FETCH_WARC=False /bin/archive &> archive2shaarli.$BOOKMARKNUM.log 34 35 COREFILES=`find . -type f -name "core.*" | perl -e 'while (<>) {print if m/core\.\d+$/}' | wc -l` 36 if [[ "$COREFILES" -gt "0" ]] 37 then 38 printf "SB: %7.7s WARNING! CORE(s) in: %s\n" $BOOKMARKNUM "$SHAARLIHOME/$BOOKMARKNUM" >> $LOG 39 rm -f archive/*/core.* ; rm -f archive/core.* 40 fi 41 printf "\n##############################################\n\n" >> $LOG 42 cat archive2shaarli.$BOOKMARKNUM.log >> $LOG 43 printf "\n#############à################################\n\n" >> $LOG 44 45 ARCHIVEBOXFOLDER=`find archive -type d | grep / | cut -d/ -f2 | sort | uniq` 46 47 FAILURE=`grep -c '<title>Not yet archived...</title>' archive/$ARCHIVEBOXFOLDER/index.html` 48 if [[ "$FAILURE" -ge "1" ]] 49 then 50 printf "SB: %7.7s failed to archive %s\n" $BOOKMARKNUM $CURRENT_BOOKMARK >> $LOG 51 mkdir $SHAARLIHOME/$BOOKMARKNUM 52 echo "<html><title>Failed to archive $CURRENT_BOOKMARK</title></head><body><pre>" > $SHAARLIHOME/$BOOKMARKNUM/index.html 53 cat archive2shaarli.$BOOKMARKNUM.log >> $SHAARLIHOME/$BOOKMARKNUM/index.html 54 echo '</pre></body></html>' >> $SHAARLIHOME/$BOOKMARKNUM/index.html 55 rm -rf archive/$ARCHIVEBOXFOLDER 56 else 57 printf "SB: %7.7s moving copy from : archive/$ARCHIVEBOXFOLDER to $SHAARLIHOME/$BOOKMARKNUM\n" $BOOKMARKNUM >> $LOG 58 mv archive/$ARCHIVEBOXFOLDER $SHAARLIHOME/$BOOKMARKNUM 59 perl -pi.bak -e 's/\.\.\/\.\.\/static\//..\/static\//g' $SHAARLIHOME/$BOOKMARKNUM/index.html 60 rm $SHAARLIHOME/$BOOKMARKNUM/index.html.bak 61 mv archive2shaarli.$BOOKMARKNUM.log $SHAARLIHOME/$BOOKMARKNUM/ 62 fi 63 printf "SB: %7.7s see local copy: http://zona-m.net/marco/webarchive/%s\n" $BOOKMARKNUM $BOOKMARKNUM >> $LOG 64 ((ADDED_BOOKMARKS++)) 65 rm -rf sources static index.html index.json robots.txt 66 fi 67 if [[ "$ADDED_BOOKMARKS" -gt "BOOKMARKSLIMIT" ]] 68 then 69 break 70 fi 71 done < newest_bookmarks_first.csv 72 mv $LOG ../ 73 exit
The main phases of the shaarlibox
script in Listing 2 are as follows:
- Set or load configuration variables (lines 3 to 8).
- Create a "sandbox" for the container to run into, and make it writable to everybody (lines 10 and 12).
- Load ID number and URLs of all bookmarks (line 15 and 16).
- Archive all those URLs, one at a time, with the official Docker image for ArchiveBox (the loop in lines 20 to 71).
- Move the main logfile to the parent folder and exit (lines 72 and 73).
The sandbox is the $ARCHIVEBOXHOME
, which is recreated from scratch every time the script runs. Making that folder world-writable (line 12) is necessary because the container creates folders and files with user IDs.
$SHAARLIHOME
is a web-accessible folder where the archived copies are saved. You can give it any value – as long as it corresponds to the example.com/webarchive
string you added to the Shaarli template file.
$ADDED_BOOKMARKS
in line 18 counts how many bookmarks have been archived. Lines 67 to 70 terminate the script as soon as that counter equals the value of $BOOKMARKSLIMIT
set in line 5. I strongly suggest that you set this variable to a very low value, say 10 or 20, the first time you run the script. In this way, you can quickly get an idea of how much time and space it would take to archive everything.
The main logfile ($LOG
) has a unique prefix, corresponding to the date and hour when the script starts (lines 6 and 7). Finally, $SHAARLI_JSON
is the file generated by the shaarli
backup script in Listing 1, which is passed as the first argument to shaarlibox.
The pipeline of text-processing commands in Phase 3 work as follows: first
, perl
, grep
, and cut
extract the id
and url
fields of all bookmarks from the JSON file. Their final output is then sorted in reverse numerical order (placing the most recent bookmarks first), using the id numbers as keys. To see what the options of each command mean, please consult the corresponding man pages. The final result, which is saved in the file newest_bookmarks_first.csv
, has this format (the URLs are trimmed for clarity):
https://www.zdnet.com/article/linux-foundation..../|296 https://www.linux.com/audience/devops/...|295 https://www.linuxuprising.com/2019/08/...|294
The main loop (lines 20 to 71) reads that file, one line at a time, saving the current ID and URL values in $BOOKMARKNUM
and $CURRENT_BOOKMARK
(lines 22 and 23). All the printf
statements in the loop write what is happening to one main log file ($LOG
) for later analysis and debugging. Lines 26 to 28 check if $SHAARLIHOME
already contains a sub-folder called $BOOKMARKNUM
, which means that the bookmark was already archived in a previous run, so nothing else happens. Otherwise, the script writes the $CURRENT_BOOKMARK
inside the file url_list.csv
(line 31) that will serve as the input for ArchiveBox.
You might wish to modify the script to write, say, 10 or 20 URLs at a time inside newest_bookmarks_first.csv
and then fetch them all with just one call of the ArchiveBox container. Calling that container once per bookmark slows the script down. However, if the script downloads just one bookmark per container call, with $BOOKMARKSLIMIT
set to a very low value, it is much quicker to check if you are using the best combination of options for ArchiveBox before letting it loose on all your bookmarks.
The actual archiving happens in line 33. This command, which was copied from the documentation, downloads the official container image of ArchiveBox and all its dependencies and runs it with Docker. My only additions are the optional settings WGET_USER_AGENT
and FETCH_WARC
. The first tells the wget
download utility inside the container to identify itself as some desktop browser whenever it visits a website. This step is necessary, because some websites will refuse to serve pages to automatic downloaders. The FETCH_WARC
setting prevents ArchiveBox from creating WARC archives of each bookmark, to save some disk space.
Just like Shaarli, ArchiveBox gives to each URL it processes a unique numeric identifier (which I decided to call $ARCHIVEBOXFOLDER
) and saves all the copies in a folder with the same name, inside the local archive
directory. Since only one URL is processed every time, whatever subdfolder there is in archive
is named for the numeric identifier, and finding its value is the task of line 45.
When ArchiveBox archives a web page inside $ARCHIVEBOXFOLDER
, the index.html
file inside that directory has that page's URL if the archive process succeeds; otherwise it is set to Not yet archived…. If such a string is found in the index.html
file, the $FAILURE
flag is set to 1
(line 47). When that happens, the code in lines 49 to 55 creates a new folder in $SHAARLIHOME/$BOOKMARKNUM
and saves the ArchiveBox logfile inside it, in HTML format. In this way, if I click on the LOCAL COPY link for the bookmark inside Shaarli, I will see what happened (Figure 7), even if I missed it in the general $LOG
file.
If $FAILURE
is zero, it means that ArchiveBox succeeded, and all the versions of the current bookmark are saved and indexed inside archive/$ARCHIVEBOXFOLDER
. So, aside from logging the event, all that is left to do is (lines 56 to 66) to move archive/$ARCHIVEBOXFOLDER
to $SHAARLIHOME/$BOOKMARKNUM
(which is where the LOCAL COPY link in Shaarli points), fix some relative links of the index.html
file (lines 59 and 60, see below for details), increment the $ADDED_BOOKMARKS
counter, and remove all the auxiliary files left by ArchiveBox (line 65). This last step is necessary because if the next run of ArchiveBox found those files, it would create extra folders, which would break the simple command used in line 45 to find $ARCHIVEBOXFOLDER
.
The first time ArchiveBox runs, or believes it is running, it creates one static
folder, at the same level as the archive
folder, where it puts all the icons and CSS files used by the indexes of each page. Therefore, all the links to those resources inside each index.html
file have the form ../../static/
. Since I needed the static
folder to be inside $SHAARLIHOME
, all those strings must change from ../../static
to ../static
, which is the purpose of the Perl command in line 59.
General Set Up and Maintenance
The script in Listing 2 makes one ArchiveBox container call per bookmark. This makes it noticeably slower, but much simpler and much more future-proof. As is, if you ever want to replace ArchiveBox with another archiver, you only have to rewrite two lines of code (33 and 45)! Besides, if ArchiveBox processed 10 or 20 bookmarks in each run it would be more complicated to match them with their Shaarli identifiers, and much easier to fill your hard drive, or miss downloading errors.
Due to lack of space, I can only briefly discuss three possible improvements that will strengthen this bookmark archiving system: privacy, backups, and disk space. Shaarli bookmarks can be all private by default, but the archive itself will be private only if you password-protect the $SHAARLIHOME
folder at the web server level. For backups, instead, the only specific information you need is which Shaarli files to back up [9].
Using the options in Listing 2, archiving about 2,100 bookmarks created over 230K files on my server, requiring more than 23GB. You can save lots of space, however, by choosing an efficient archive format and using a tool like rdfind
[10] to replace duplicate files with hard links.
Long-term maintenance of the whole system is relatively simple, but necessary. On the ArchiveBox side, you might need to update the shaarlibox
script if new versions use different command-line options or directory structures. With Shaarli, whenever you upgrade to a new version, you will need to manually patch the template.
Buy this article as PDF
(incl. VAT)