Process structured text files with Miller
One by One
Miller offers a clever alternative for working with structured text files: use a single tool to replace the strings of commands built from conventional utilities like grep, cut, and sed.
Miller [1] is a helpful command-line tool for working with structured files. Instead of contending with long instructions lined with pipes, you can achieve your goals with more compact constructs.
TIP
If no output appears, Miller is missing the reference to the Newline special character; this problem often occurs in CSV files. If you enter mlr --csv --rs lf
at the beginning of the command, processing should work.
Miller supports a variety of formats (Table 1), which it lists when called with mlr --usage-data-format-examples
. We used version 3.1.2 for this article, freshly compiled from the sources.
Table 1
Data Structures
Type/Format specification | Features |
---|---|
dkvp |
Identifier with value assignments, comma as field separator (Variable=Value,) |
nidx |
Numeric field identifier, comma as field separator (Variable=Value,) |
csv |
Not a field label, text optionally in quotes, comma as field separator (a,b,c) |
pprint |
Formatted output from Miller, produces tables |
xtab |
Outputs tables vertically, one field label with a value in each line |
Miller is a single utility that lets you combine the effects of several classic Unix tools, like grep, cut, join, sort, tail, head, and sed. The syntax of mlr
uses commands with their own options. Table 2 shows a selection of commands for mlr
. See the box called "Some Examples" for examples of mlr
commands.
Some Examples
### csv1.txt first,second,third a,b,c d,e,f ### csv2.txt first,second,third 1,2,3 4,5,6 ### csv3.txt Name,first_name,amount Miller,Hans,12.34 Meier,Klaus,56.78 Bauer,Stefan,90.12 ### csv4.txt Name,first_name,amount Schmidt,Johann,12.34 Meier,Klaus,56.78 Albert,Stefan,90.12 ### dkvp1.txt a=1,b=2,c=3 d=4,e=5,f=6 ### dkvp2.txt a=1,b=2,c=3 d=4,e=5 f=7,g=8,h=9
Table 2
Miller: Command Overview
Command | Options | Function/Notes |
---|---|---|
cat |
|
Like the cat shell command |
|
-n |
Adds another column with ascending enumeration on the left |
|
-N Name |
Like -n, but with a name for the column with the enumeration |
decimate |
|
Uses every tenth line of data |
|
-n N |
Uses every Nth line of data |
cut |
|
Like the cut shell command |
|
-f Name,… |
Only output the fields with this column name |
|
-o |
vor -f: Additionally output the fields in the specified order |
|
-x |
before -f: Do not output the specified fields |
filter |
|
Output data lines with the stated features |
|
'FNR == N' |
Outputs every Nth line |
grep |
|
Like the grep shell command, but with a restricted feature set |
|
-v |
Outputs non-matching lines |
group-by |
|
Outputs identical lines in a group |
group-like |
|
Outputs lines with identical identifiers |
head |
|
Outputs the start of a file |
|
-n Lines |
Number of lines without the header (mandatory) |
join |
|
Join two files via a shared column |
|
-u |
Proceses unsorted input |
|
-j Column,… |
States the shared fields |
|
-f File |
States the file on the left |
rename Alt,New |
|
Rename field designator |
|
-r |
State the old field name as a regular expression |
reorder |
|
Change the column order |
|
-f Columns |
States the order (mandatory) |
|
--e |
Output the stated columns at the end of the line |
sample |
|
Output a number of line in arbitrary position |
|
-k Lines |
States the line count, not including headers |
sort |
|
Sorting |
|
-f Name,… |
Ascending by stated columns, characters of all types |
|
-f Name,… |
Descending by stated columns, characters of all types |
|
-nf Name,… |
Ascending by stated columns, numeric |
|
-nr Name,… |
Descending by stated columns, numeric |
stats1 |
|
Computations |
|
-a sum -f Column,… |
Sum |
|
-a count -f Column,… |
Record/line count |
|
-a mean -f Column,… |
Average |
|
-a min -f Column,… |
Minimum |
|
-a max -f Column,… |
Maximum |
step |
|
Stepwise output of computational results |
|
--a rsum -f Column,… |
Subtotal, output per line |
|
--a delta -f Column,… |
Difference between two subsequent lines |
|
--a ratio -f Column,… |
Relationship between two subsequent lines |
|
--a counter -f Column,… |
Ongoing output of the number of records |
|
--a <from-first -f Column,… |
Difference to first record output |
tac |
|
like tac shell command (output in reverse order) |
tail |
|
Output the end of the file (counterpart to head) |
|
-n Lines |
Number of lines without a header |
top |
|
Output lines/records with the highest or lowest numeric value |
|
-f Column,… |
State the columns with matching numeric values |
|
-a |
Output all columns of a line |
|
--min |
Output the smallest numeric value |
|
-n Lines |
Number of lines to output |
uniq |
|
Output identical records grouped |
|
-g Column,… |
Output the columns to be evaluated |
|
-n |
Only determine the number of records to be output, grouped |
|
-c |
State the number of its occurrences for each grouped record |
bar |
|
Output numeric values as ASCII bar charts |
|
-f Column |
Output the column with the numeric values |
|
-c Character |
State the bar character (default: *) |
|
-x Character |
State the character for the values outside of the display range, (default: #) |
|
-b Character |
State the padding character (default: .) |
|
-w Bar width |
State the bar width, default: 40 |
|
--lo Value |
Initial value bar chart |
|
--lo Value |
Final value bar chart |
To separate the parts of the input, you will usually want to use commas. Miller provides an option for defining the formatting separately for input, the output, or both together. If you want to determine the file format for the input and output separately, use a leading i
for the input and a o
for the output. Table 3 lists some important separator symbols.
Table 3
Separators
Task | Statement | Instructions |
---|---|---|
Set separator |
--rs |
e.g., lf or '\r\n' |
Field separator |
--fs |
e.g., ',' or ';' |
Pair separator |
--ps |
only relevant for DKVP files |
Output
The cat
command reads from text files and outputs them – appropriately formatted if necessary – to a pipe, a file, or the screen. The call in the first line of Listing 1 outputs the two specified files in succession with the column headings (Figure 1). In addition, Miller automatically adds its own numerical identifiers for the fields.
Listing 1
Miller's cat
01 $ mlr cat csv1.txt csv2.txt 02 $ mlr --csv --rs lf cat csv1.txt csv2.txt 03 $ mlr --opprint cat csv1.txt csv2.txt 04 $ mlr --opprint --csv --rs lf cat csv1.txt csv2.txt 05 $ mlr --csv --rs lf --opprint cat csv1.txt csv2.txt 06 $ mlr --icsv --rs lf --odkvp cat csv1.txt > newdkvp.txt 07 $ mlr --idkvp --ocsv --rs lf cat dkvp1.txt > newcsv.txt 08 $ mlr --icsv --rs lf --oxtab cat csv3.txt > newxtab.txt
If you specify the file type (csv
in Listing 1) and a newline (--rs lf
) as the separator for the data, Miller does not enumerate (Listing 1, line 2). It also groups identical column headings into a single heading (Figure 2).
The --opprint
option gives you even clearer output (Listing 1, line 3), but with a minor error. The program inserts its own column headings (Figure 3, first line). Miller lists the headings in the output files like records.
The order of options affects the results (Figure 4). While the option --opprint
is apparently ignored by the call in line 4 of Listing 1, it works correctly in the opposite direction (Listing 1, line 5): The software combines the identical headings and displays the values with a delta to match the header.
Miller Converts
Using cat
, Miller converts the formats listed in Table 1. Put an i
in front of the name of the input and an o
in front of the output, and Miller creates a DKVP format from a CSV file (Listing 1, line 6). The reverse approach works in the same way (Listing 1, line 7).
Converting to a line-by-line display (XTAB format) is useful, for example, when creating non-GUI applications, say, querying addresses (Listing 1, line 8). You will find the processed examples in Figure 5.
Searching and Finding
For browsing structured text files, Miller has the grep
and filter
commands. filter
has a variety of options, particularly with regard to numerical evaluations. The software always outputs the header. The example from the first two lines of Listing 2 shows how to browse csv3.txt
for the name "Meier". With the filter
command, you specify the column; grep
does not need the column. The first method is thus more precise because the term could exist in multiple columns.
Listing 2
Looking for a Name
$ mlr --csv --rs lf filter '($Name == "Meier")' csv3.txt $ mlr --csv --rs lf grep 'Meier' csv3.txt $ mlr --csv --rs lf filter '($amount > 20)' csv3.txt
The example in the last line of Listing 2 shows the results of a numerical analysis. Miller extracts all amounts greater than 20 Euros from cvs3.txt
.
Figure 6 shows the three commands, as well as the resulting output.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.