Process structured text files with Miller

One by One

© Lead Image © bahri altay, 123RF.com

© Lead Image © bahri altay, 123RF.com

Article from Issue 187/2016
Author(s):

Miller offers a clever alternative for working with structured text files: use a single tool to replace the strings of commands built from conventional utilities like grep, cut, and sed.

Miller [1] is a helpful command-line tool for working with structured files. Instead of contending with long instructions lined with pipes, you can achieve your goals with more compact constructs.

TIP

If no output appears, Miller is missing the reference to the Newline special character; this problem often occurs in CSV files. If you enter mlr --csv --rs lf at the beginning of the command, processing should work.

Miller supports a variety of formats (Table 1), which it lists when called with mlr --usage-data-format-examples. We used version 3.1.2 for this article, freshly compiled from the sources.

Table 1

Data Structures

Type/Format specification

Features

dkvp

Identifier with value assignments, comma as field separator (Variable=Value,)

nidx

Numeric field identifier, comma as field separator (Variable=Value,)

csv

Not a field label, text optionally in quotes, comma as field separator (a,b,c)

pprint

Formatted output from Miller, produces tables

xtab

Outputs tables vertically, one field label with a value in each line

Miller is a single utility that lets you combine the effects of several classic Unix tools, like grep, cut, join, sort, tail, head, and sed. The syntax of mlr uses commands with their own options. Table 2 shows a selection of commands for mlr. See the box called "Some Examples" for examples of mlr commands.

Some Examples

### csv1.txt
first,second,third
a,b,c
d,e,f
### csv2.txt
first,second,third
1,2,3
4,5,6
### csv3.txt
Name,first_name,amount
Miller,Hans,12.34
Meier,Klaus,56.78
Bauer,Stefan,90.12
### csv4.txt
Name,first_name,amount
Schmidt,Johann,12.34
Meier,Klaus,56.78
Albert,Stefan,90.12
### dkvp1.txt
a=1,b=2,c=3
d=4,e=5,f=6
### dkvp2.txt
a=1,b=2,c=3
d=4,e=5
f=7,g=8,h=9

Table 2

Miller: Command Overview

Command

Options

Function/Notes

cat

  

Like the cat shell command

 

-n

Adds another column with ascending enumeration on the left

 

-N Name

Like -n, but with a name for the column with the enumeration

decimate

  

Uses every tenth line of data

 

-n N

Uses every Nth line of data

cut

  

Like the cut shell command

 

-f Name,…

Only output the fields with this column name

 

-o

vor -f: Additionally output the fields in the specified order

 

-x

before -f: Do not output the specified fields

filter

  

Output data lines with the stated features

 

'FNR == N'

Outputs every Nth line

grep

  

Like the grep shell command, but with a restricted feature set

 

-v

Outputs non-matching lines

group-by

  

Outputs identical lines in a group

group-like

  

Outputs lines with identical identifiers

head

  

Outputs the start of a file

 

-n Lines

Number of lines without the header (mandatory)

join

  

Join two files via a shared column

 

-u

Proceses unsorted input

 

-j Column,…

States the shared fields

 

-f File

States the file on the left

rename Alt,New

  

Rename field designator

 

-r

State the old field name as a regular expression

reorder

  

Change the column order

 

-f Columns

States the order (mandatory)

 

--e

Output the stated columns at the end of the line

sample

  

Output a number of line in arbitrary position

 

-k Lines

States the line count, not including headers

sort

  

Sorting

 

-f Name,…

Ascending by stated columns, characters of all types

 

-f Name,…

Descending by stated columns, characters of all types

 

-nf Name,…

Ascending by stated columns, numeric

 

-nr Name,…

Descending by stated columns, numeric

stats1

  

Computations

 

-a sum -f Column,…

Sum

 

-a count -f Column,…

Record/line count

 

-a mean -f Column,…

Average

 

-a min -f Column,…

Minimum

 

-a max -f Column,…

Maximum

step

  

Stepwise output of computational results

 

--a rsum -f Column,…

Subtotal, output per line

 

--a delta -f Column,…

Difference between two subsequent lines

 

--a ratio -f Column,…

Relationship between two subsequent lines

 

--a counter -f Column,…

Ongoing output of the number of records

 

--a <from-first -f Column,…

Difference to first record output

tac

  

like tac shell command (output in reverse order)

tail

  

Output the end of the file (counterpart to head)

 

-n Lines

Number of lines without a header

top

  

Output lines/records with the highest or lowest numeric value

 

-f Column,…

State the columns with matching numeric values

 

-a

Output all columns of a line

 

--min

Output the smallest numeric value

 

-n Lines

Number of lines to output

uniq

  

Output identical records grouped

 

-g Column,…

Output the columns to be evaluated

 

-n

Only determine the number of records to be output, grouped

 

-c

State the number of its occurrences for each grouped record

bar

  

Output numeric values as ASCII bar charts

 

-f Column

Output the column with the numeric values

 

-c Character

State the bar character (default: *)

 

-x Character

State the character for the values outside of the display range, (default: #)

 

-b Character

State the padding character (default: .)

 

-w Bar width

State the bar width, default: 40

 

--lo Value

Initial value bar chart

 

--lo Value

Final value bar chart

To separate the parts of the input, you will usually want to use commas. Miller provides an option for defining the formatting separately for input, the output, or both together. If you want to determine the file format for the input and output separately, use a leading i for the input and a o for the output. Table 3 lists some important separator symbols.

Table 3

Separators

Task

Statement

Instructions

Set separator

--rs

e.g., lf or '\r\n'

Field separator

--fs

e.g., ',' or ';'

Pair separator

--ps

only relevant for DKVP files

Output

The cat command reads from text files and outputs them – appropriately formatted if necessary – to a pipe, a file, or the screen. The call in the first line of Listing 1 outputs the two specified files in succession with the column headings (Figure 1). In addition, Miller automatically adds its own numerical identifiers for the fields.

Listing 1

Miller's cat

01 $ mlr cat csv1.txt csv2.txt
02 $ mlr --csv --rs lf cat csv1.txt csv2.txt
03 $ mlr --opprint cat csv1.txt csv2.txt
04 $ mlr --opprint --csv --rs lf cat csv1.txt csv2.txt
05 $ mlr --csv --rs lf --opprint cat csv1.txt csv2.txt
06 $ mlr --icsv --rs lf --odkvp cat csv1.txt > newdkvp.txt
07 $ mlr --idkvp --ocsv --rs lf cat dkvp1.txt > newcsv.txt
08 $ mlr --icsv --rs lf --oxtab cat csv3.txt > newxtab.txt
Figure 1: The Miller cat command prints the contents of a file if no options are specified.

If you specify the file type (csv in Listing 1) and a newline (--rs lf) as the separator for the data, Miller does not enumerate (Listing 1, line 2). It also groups identical column headings into a single heading (Figure 2).

Figure 2: Specify the file type and the separator for the data as parameters to improve the output.

The --opprint option gives you even clearer output (Listing 1, line 3), but with a minor error. The program inserts its own column headings (Figure 3, first line). Miller lists the headings in the output files like records.

Figure 3: Visually enhanced output with column headings.

The order of options affects the results (Figure 4). While the option --opprint is apparently ignored by the call in line 4 of Listing 1, it works correctly in the opposite direction (Listing 1, line 5): The software combines the identical headings and displays the values with a delta to match the header.

Figure 4: The order in which you specify options has an impact on the result.

Miller Converts

Using cat, Miller converts the formats listed in Table 1. Put an i in front of the name of the input and an o in front of the output, and Miller creates a DKVP format from a CSV file (Listing 1, line 6). The reverse approach works in the same way (Listing 1, line 7).

Converting to a line-by-line display (XTAB format) is useful, for example, when creating non-GUI applications, say, querying addresses (Listing 1, line 8). You will find the processed examples in Figure 5.

Figure 5: Miller easily converts data structures from one format to another.

Searching and Finding

For browsing structured text files, Miller has the grep and filter commands. filter has a variety of options, particularly with regard to numerical evaluations. The software always outputs the header. The example from the first two lines of Listing 2 shows how to browse csv3.txt for the name "Meier". With the filter command, you specify the column; grep does not need the column. The first method is thus more precise because the term could exist in multiple columns.

Listing 2

Looking for a Name

$ mlr --csv --rs lf filter '($Name == "Meier")'  csv3.txt
$ mlr --csv --rs lf grep 'Meier' csv3.txt
$ mlr --csv --rs lf filter '($amount > 20)'  csv3.txt

The example in the last line of Listing 2 shows the results of a numerical analysis. Miller extracts all amounts greater than 20 Euros from cvs3.txt.

Figure 6 shows the three commands, as well as the resulting output.

Figure 6: The Miller commands filters and grep make it easy to extract specific data.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line: sort

    sort helps you organize file lists and program

    output. And if you like, you can even use this small

    but powerful tool to merge and sort multiple files.

  • Tool Tips

    We test Yuck, Uftpd, Guncat, Kiwix, Miller, and Debian Package Search.

  • Stat-like Tools for Admins

    ASCII tools can be life savers when they provide the only access you have to a misbehaving server. However, once you're on the node what do you do? In this article, we look at stat-like tools: vmstat, dstat, and mpstat.

  • PHP-CLI

    PHP is not just for websites. Command-line PHP scripting has been around for more than 10 years, which makes the language and its comprehensive libraries eminently suitable for the toolbox of any administrator who manages web servers.

  • System Diagnosis Tools

    To check on the health of a Linux system, administrators can turn to vmstat, iostat, netstat, and ifstat. Or, you can just use the versatile dstat, which combines the features of several tools in a single package.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News