PDF creators, extractors, and editors tested
Mutool
The small command-line tool Mutool [5] is part of the simple PDF viewer, MuPDF. Manufacturer Artifex describes it as the "Swiss army knife of PDF manipulation tools." If this is true, something is definitely wrong with Swiss engineering: To be more precise, it can only regenerate the PDF, extract the fonts and images, display some information, and arrange the pages on a giant poster.
Like MuPDF, Mutool is released under the Affero GPL. We looked at version 1.2.2, which can be found in the repositories of Ubuntu 13.10 as the mupdf-tools
package.
Mutool uncomplainingly extracted all the images and the associated text from the documents written with InDesign, LibreOffice, and Scribus. Because the tool had no idea what to do with vector graphics, it only returned the font used in the Inkscape PDF, DejaVu Sans.
Mutool stores the fonts in the file format that it finds in the PDF. In InDesign documents, this meant that PostScript fonts in CFF and CID [6] formats were generated. In contrast, the free applications had embedded TrueType fonts. The exception is Scribus, which embeds fonts as PFA files.
Mutool always provides images in PNG format; the tool arbitrarily converts all other image types. Mutool will open encrypted PDF files if the user tells it the password via an extra parameter.
Poppler Utilities
The pdftotext
command-line tool now belongs to the Poppler Toolbox, which in turn was created as a fork of Xpdf [7]. Most distributions provide Poppler in their repositories; on Ubuntu 13.10, pdftotext
resides in the poppler-utils
package. We tested version 0.24.1.
As its name suggests, pdftotext extracts the text from a PDF document. The results will always require postprocessing: In multicolumn documents, the extraction begins at the top left and stops at the bottom right on a page. The author box in the sample article ended up in the middle of the text, but at least no text was lost.
Using command-line options, you can restrict the analysis to individual pages and rectangular areas. On request, pdftotext will try to keep the layout (Figures 8 and 9). Columns and indentation are simulated with spaces. This makes it possible to read the LibreOffice test document, but spaces hinder further processing.
Although users can also specify the character encoding, many non-standard characters were not recognized in the text generated from the sample documents. Pdftotext had no problems with password protection in the LibreOffice document; users only need to pass in the password with a parameter.
Fishing for Photos
In addition to pdftotext
, the Poppler Tools also include pdfimages
, which extracts images, and pdftohtml
, which converts a PDF into HTML pages. Pdfimages only extracts bitmap images from the PDF and then stores them in PPM format. You need to specify the -j
switch to create JPEG images. As in pdftotext, you can pass in the password; the tool cannot handle vector graphics.
Pdftohtml behaves like a mixture of pdftotext and pdfimages: It extracts the images and dumps the text into one or more HTML files. To avoid a jumble of characters, we explicitly specified character encoding. Pdftohtml will add the links in the PDF document to the HTML results, if so desired.
The extracted text was just as jumbled as pdftotext, and users cannot force the layout in this case. To compensate, the tool can create a "complex document." Here, the layout and the images are transferred into a large PNG image, onto which the browser then superimposes the text (Figure 10). The result is indeed reminiscent of the origin layout, but you can only copy and edit the text. Moreover, pdftohtml totally ignores all vector graphics.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.