Document conversion from the command line
Command Line – Pandoc
Pandoc lets you convert files from one markup format to another at the command line.
A strength of free software is that applications usually have everything users need for a specific purpose, a tendency that is especially strong in apps for KDE and the command line. Pandoc [1], a universal document converter, exemplifies this strength.
First released in 2006 by John MacFarlane, a philosophy professor at the University of California, Berkeley, Pandoc is a Haskell library for converting between text formats, especially those using a markup format (Table 1). In effect, it is an all-in-one replacement for the dozens of scripts that exist in many distributions for the same purpose.
Pandoc is not equipped to precisely convert complicated layout, such as margins and tables, in formats like PDF or Open Document Form (ODF). However, templates can be created for different formats. Sometimes, though, converting content alone is far better than not at all. Moreover, in many cases, Pandoc is adequate for simple formats, like articles or essays, especially in a markup language. It also has advanced features for slide shows, citations, and bibliographies.
By default, Pandoc produces a document fragment as standard output (Figure 1). The general output type plus the specific format must be specified, as well as the input source:
pandoc -f markdown -t latex pandoc.txt
The result is a fragment for the extension specified that can pasted into another document. To save the output, you must specify a file using the --output
(-o
) option. If you want a complete file, rather than a fragment, add the --standalone
option. As with many command-line options, saving to a file produces no output unless something goes wrong.
If you do not specify the input and output, Pandoc will attempt to guess them. To ensure formatting, a template file can be specified (see the Templates section below). Use the -t
option to list the types of formats supported. If multiple input files are specified, they are concatenated into a single output file with a space between the contents of each input file.
Templates
Each supported format has a default template stored in /usr/share/pandoc/data/templates/
. Most follow the naming structure default.FORMAT
. Exceptions include ODT's template, which is named default.opendocument
, and PDF, which shares the default.latex
template. In addition, EPUB uses epub-page.html
, epub-coverimage.html
, and epub-titlepage.html
. You can view the default template using the command pandoc -D FORMAT
(Figure 2).
You can write or download custom templates [2] or modify copies of existing templates [3] if the default template does not meet your needs. Templates consist of fields with fixed values and may include variables that are replaced by elements of the source file, often automatically. For example, the variable <title>$title$</title>
is replaced automatically by the source file's title. More advanced users can include if/else or conditional statements. For a full description of custom templates, see Pandoc's man page and user guide [4].
In the end, if content is more important than structure, you can generally use the default templates without tweaking them.
Note that early releases of Pandoc required additional applications to convert to PDF. Several online sources like Wikipedia continue to list this requirement, but it is now obsolete.
Input/Output Options
Instead of templates, you can do some formatting using options. To eliminate any ambiguity in the command structure, you can specify the input format with --from FORMAT
(-f FORMAT
) or --read FORMAT
(-r FORMAT
), and the output with --to FORMAT
(-t FORMAT
) or --write FORMAT
(-w FORMAT
). Similarly, although the default directory for all output to a file is .pandoc
, you can specify another directory with --data-dir=DIRECTORY
.
Other options affect the internal formatting. For instance, while the default format is to replace tabs with spaces, --preserve-tab
(-pv
) will override the default. When setting up tabs, you may also use --tab-stop=NUMBER
to change the default four spaces used for tabs. You can also use --base-header-level=NUMBER
to set the first heading level to use and --smart
(-S
) to use typographic characters such as smart quotes and em dashes (instead of two hyphens).
Individual formats also have their own formatting options. For instance, in HTML5, --section-div
adds <div>
or <section>
tags, which can be formatted with CSS stye sheets created outside Pandoc. LaTeX, ConTeXt, and DocBook output can use --chapters
to convert the top-level headings into chapters, while --no-tex-ligatures
suppresses ligatures in LaTeX or ConTeXt output, which can be convenient with some recent OpenType features. More generally, several options are intended primarily for code, such as the self-explanatory --no-wrap
, --columns=NUMBER
, --no-highlight
, and --highlight=STYLE
(with options of pygments
, kate
, monochrome
, espresso
, zenburn
, haddock
, and tango
). Many of these options can reside in a single file that is specified with --defaults = FILE
, eliminating the need to continually structure a detailed command.
For many output formats, options provide most formats with the exception of spacing options. However, layout can be added via CSS style sheets and linked with --css=URL
. Some output formats have specific options for style sheets, such as --reference-odt=FILE
(ODT), --reference-docx=FILE
(DOCX), and --epub-stylesheet=FILE
(EPUB). If you regularly convert to such formats, developing a style sheet may be worth the effort. You may even find a style sheet online that you can use with little or no modification.
Special Uses
Besides routine format conversion, Pandoc has several special uses. For instance, Pandoc supports several slide show applications, including PowerPoint. However, to judge by the available options, its main emphasis is on Beamer, a LaTeX-based presentation application [5]. The markup for a Beamer slide is as simple as starting each one with ##
. To Beamer's own thorough array of features, Pandoc adds options of its own. While converting a file for use in Beamer, Pandoc can define a logo, title graphics, navigation symbols, Beamer theme, and the aspect ratio for slides. Common layouts include slide backgrounds, transitions, and lists in which items are displayed one at a time. There is even an option to add Beamer options to the converted presentation. In addition, Pandoc can convert a Beamer presentation to an article. Pandoc's emphasis on Markdown provides a professional slide show application regardless of the office suite used.
Pandoc also has extensive support for citations and bibliographies. Using the option --citedoc
, Pandoc can generate citations from a source file and a bibliographic database specified with one --bibliography=FILE
for each bibliography used. BibLaTeX (.bib
), BibTeX (.bibtex
), CSL JSON (.json
), and CSL YAML (.yaml
) are all supported formats. By default, Pandoc uses the Chicago Manual of Style citation style, although other citation formats can also be defined. There is even a --citation-abbreviations=FILE
option that can define abbreviations for often used titles. The citations and bibliography are kept separate from the Pandoc files, making it easy to update and then generate a new file.
Buy this article as PDF
(incl. VAT)