Paperwork battles the increasing stacks of paper


Article from Issue 166/2014

Paperwork was developed to manage the paperless office – a dream as old as desktop PCs.

The idea behind Paperwork [1] harks back to the dream of the paperless office: You scan incoming correspondence, invoices, and loose sheets then run them through an optical character recognition (OCR) tool that converts the content into digital form. An application then merges the image data and text in a superimposed form and saves it as a PDF.

Certain pitfalls await, however: For sufficiently good OCR you need the highest quality scans or photographs possible of the text pages. A good scanner with at least 600dpi resolution is preferred, (although 300dpi will work in some cases), and the OCR software needs to be the best fit for the job at hand. When Paperwork launches, it first searches for Tesseract [2]. If the program cannot find this very powerful OCR engine, the program falls back to Cuneiform. In most cases, Tesseract will give better results.

Getting Started

On Arch Linux, you can install Paperwork easily from the AUR. On Ubuntu, you will not currently find Paperwork in the repositories, and there is no PPA. Your best chance is to read the installation manual [3].

Paperwork is essentially based on four components. To scan the documents, Paperwork draws on Sane. Character recognition is handled by Tesseract or Cuneiform. Whoosh [4] indexes the OCR-converted texts so they can be searched easily, and the tool automatically generates suggestions for keywords. Paperwork then merges the whole enchilada into a graphical interface developed with Gtk/Glade.

The preferred Tesseract OCR engine originally came from Hewlett-Packard. Google uses the open source library system, for example, to digitize books [5]. The software excels with its excellent recognition rate and high level of automation. The drawback: Tesseract exclusively processes uncompressed TIFF input files; you thus need to convert documents where necessary.

The Paperless Office

On launch, Paperwork comes up with a clearly designed interface comprising three sections. On the left, you see the current document; next to that are the existing, scanned, and edited pages; on the right is the current page in detail. Like the gscan2pdf PDF scanner [6], Paperwork retrieves documents directly from a connected scanner or loads existing images from the hard disk.

The software merges scanned images to form projects and then exports the projects as PDF files. By default, Paperwork stores the projects in the papers folder in subdirectories named after the current date (e.g., 20140605_1350_31/). It creates several files in these directories: paper.<number>.jpg contains the JPEG images of the scanned page, paper.<number>.words contains the text extracted by the OCR engine.

These files are not stored as plain text files, however, but in the form of special XML files in hOCR format [7] containing the position in the original document in addition to plain text. It is not easy to read these files in a text editor, but you can superimpose the extracted text precisely on the image file. DjVu document format [8], which was specially developed for scanned documents, is based on this design.

Paperwork also stores preview images of the scanned pages in the directory. You can identify them by their thumb name component. Files with labels in their names store manually assigned labels for the document; a file stored as extra.txt additionally contains the keywords you assign.

Paperwork supports multiple sources for loading documents: the application can drive a scanner directly; the program automatically tries to find the scanner via the Sane back end. Alternatively, Paperwork also supports USB-connected webcams, which is usually not a good solution given the typically low resolution and poor quality. On the other hand, Paperwork uses images that have been created in any way as a source, such as screenshots of PDFs. A lack of image quality means the OCR engine rarely delivers useful results in these cases.

Additionally, Paperwork lets you edit PDF files directly. You can load these by selecting Document | Import file(s). If necessary, Paperwork will import several PDFs in one fell swoop – but not recursively from subdirectories. Thus, you need to store the data to be imported in a single directory.

Setting Up OCR

Before you start scanning documents, you need to set up the program (Figure 1). The icon for Settings is fourth from the left in the toolbar. In addition to configuring the working directory, you also configure the scanner and define the language for text recognition. Paperwork stores the settings in the ~/.config/paperwork.conf file, and it writes the index for all scanned documents to ~/.local/share/paperwork/index/.

Figure 1: The Paperwork configuration is limited to a few settings.

The scanner is calibrated in the settings dialog by clicking on the icon on the right. Paperwork then starts a scan, which it uses as the basis for further input to the device. How well this works depends to some extent on the fonts used.

Figure 2 shows an example in which the Paperwork OCR engine almost completely converted the text despite scanning at an angle. To see the words that were deciphered (in the blue frames), select Document | Advanced | Highlight all words. It is up to you to decide whether the plain text is accurate. In Figure 3, Paperwork tries its hand with a PDF generated by OpenOffice. This actually provides better conditions than a scanned document, but the result shows that many words were not recognized, as you can see from the number of words that lack blue boxes. Often, you can optimize the results by delimiting the area processed by the OCR engine in Document | Edit (Figure 4); however, this means a new, time-consuming OCR run each time you make a change.

Figure 2: Paperwork's OCR achieved good hit rates, even with poorly aligned documents.
Figure 3: Text passages without blue boxes were not identified as text by the Paperwork OCR feature.
Figure 4: You can narrow down the area to be processed in the image to optimize the OCR results.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95