Document conversion from the command line

Command Line – Pandoc

© Lead Image © Dmitry Rukhlenko, 123RF.com

© Lead Image © Dmitry Rukhlenko, 123RF.com

Article from Issue 248/2021
Author(s):

Pandoc lets you convert files from one markup format to another at the command line.

A strength of free software is that applications usually have everything users need for a specific purpose, a tendency that is especially strong in apps for KDE and the command line. Pandoc [1], a universal document converter, exemplifies this strength.

First released in 2006 by John MacFarlane, a philosophy professor at the University of California, Berkeley, Pandoc is a Haskell library for converting between text formats, especially those using a markup format (Table 1). In effect, it is an all-in-one replacement for the dozens of scripts that exist in many distributions for the same purpose.

Pandoc is not equipped to precisely convert complicated layout, such as margins and tables, in formats like PDF or Open Document Form (ODF). However, templates can be created for different formats. Sometimes, though, converting content alone is far better than not at all. Moreover, in many cases, Pandoc is adequate for simple formats, like articles or essays, especially in a markup language. It also has advanced features for slide shows, citations, and bibliographies.

By default, Pandoc produces a document fragment as standard output (Figure 1). The general output type plus the specific format must be specified, as well as the input source:

pandoc -f markdown -t latex pandoc.txt
Figure 1: By default, Pandoc writes a fragment to standard output.

The result is a fragment for the extension specified that can pasted into another document. To save the output, you must specify a file using the --output (-o) option. If you want a complete file, rather than a fragment, add the --standalone option. As with many command-line options, saving to a file produces no output unless something goes wrong.

If you do not specify the input and output, Pandoc will attempt to guess them. To ensure formatting, a template file can be specified (see the Templates section below). Use the -t option to list the types of formats supported. If multiple input files are specified, they are concatenated into a single output file with a space between the contents of each input file.

Templates

Each supported format has a default template stored in /usr/share/pandoc/data/templates/. Most follow the naming structure default.FORMAT. Exceptions include ODT's template, which is named default.opendocument, and PDF, which shares the default.latex template. In addition, EPUB uses epub-page.html, epub-coverimage.html, and epub-titlepage.html. You can view the default template using the command pandoc -D FORMAT (Figure 2).

Figure 2: The beginning of the default template for ODF.

You can write or download custom templates [2] or modify copies of existing templates [3] if the default template does not meet your needs. Templates consist of fields with fixed values and may include variables that are replaced by elements of the source file, often automatically. For example, the variable <title>$title$</title> is replaced automatically by the source file's title. More advanced users can include if/else or conditional statements. For a full description of custom templates, see Pandoc's man page and user guide [4].

In the end, if content is more important than structure, you can generally use the default templates without tweaking them.

Note that early releases of Pandoc required additional applications to convert to PDF. Several online sources like Wikipedia continue to list this requirement, but it is now obsolete.

Input/Output Options

Instead of templates, you can do some formatting using options. To eliminate any ambiguity in the command structure, you can specify the input format with --from FORMAT (-f FORMAT) or --read FORMAT (-r FORMAT), and the output with --to FORMAT (-t FORMAT) or --write FORMAT (-w FORMAT). Similarly, although the default directory for all output to a file is .pandoc, you can specify another directory with --data-dir=DIRECTORY.

Other options affect the internal formatting. For instance, while the default format is to replace tabs with spaces, --preserve-tab (-pv) will override the default. When setting up tabs, you may also use --tab-stop=NUMBER to change the default four spaces used for tabs. You can also use --base-header-level=NUMBER to set the first heading level to use and --smart (-S) to use typographic characters such as smart quotes and em dashes (instead of two hyphens).

Individual formats also have their own formatting options. For instance, in HTML5, --section-div adds <div> or <section> tags, which can be formatted with CSS stye sheets created outside Pandoc. LaTeX, ConTeXt, and DocBook output can use --chapters to convert the top-level headings into chapters, while --no-tex-ligatures suppresses ligatures in LaTeX or ConTeXt output, which can be convenient with some recent OpenType features. More generally, several options are intended primarily for code, such as the self-explanatory --no-wrap, --columns=NUMBER, --no-highlight, and --highlight=STYLE (with options of pygments, kate, monochrome, espresso, zenburn, haddock, and tango). Many of these options can reside in a single file that is specified with --defaults = FILE, eliminating the need to continually structure a detailed command.

For many output formats, options provide most formats with the exception of spacing options. However, layout can be added via CSS style sheets and linked with --css=URL. Some output formats have specific options for style sheets, such as --reference-odt=FILE (ODT), --reference-docx=FILE (DOCX), and --epub-stylesheet=FILE (EPUB). If you regularly convert to such formats, developing a style sheet may be worth the effort. You may even find a style sheet online that you can use with little or no modification.

Special Uses

Besides routine format conversion, Pandoc has several special uses. For instance, Pandoc supports several slide show applications, including PowerPoint. However, to judge by the available options, its main emphasis is on Beamer, a LaTeX-based presentation application [5]. The markup for a Beamer slide is as simple as starting each one with ##. To Beamer's own thorough array of features, Pandoc adds options of its own. While converting a file for use in Beamer, Pandoc can define a logo, title graphics, navigation symbols, Beamer theme, and the aspect ratio for slides. Common layouts include slide backgrounds, transitions, and lists in which items are displayed one at a time. There is even an option to add Beamer options to the converted presentation. In addition, Pandoc can convert a Beamer presentation to an article. Pandoc's emphasis on Markdown provides a professional slide show application regardless of the office suite used.

Pandoc also has extensive support for citations and bibliographies. Using the option --citedoc, Pandoc can generate citations from a source file and a bibliographic database specified with one --bibliography=FILE for each bibliography used. BibLaTeX (.bib), BibTeX (.bibtex), CSL JSON (.json), and CSL YAML (.yaml) are all supported formats. By default, Pandoc uses the Chicago Manual of Style citation style, although other citation formats can also be defined. There is even a --citation-abbreviations=FILE option that can define abbreviations for often used titles. The citations and bibliography are kept separate from the Pandoc files, making it easy to update and then generate a new file.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Pandoc Website

    Build a simple web page in Markdown; then convert it to HTML at the command line.

  • UberWriter

    The UberWriter text editor, which is optimized for Markdown, includes a number of interesting features and does some amazing things. We show you how it works.

  • Scientist's Toolbox

    Linux and science are a natural fit. These are a handful of essential software packages both for getting work done and presenting it to others.

  • Tutorials – Markdown

    Create attractive and structured documents from the comfort of your text editor – and convert them to a huge array of formats.

  • Tutorial – Manuskript

    The Manuskript editor is all you need to jump start your next writing project.

comments powered by Disqus