Useful tricks for processing CSV files

csvquote

All the tricks shown so far rely on the seldom mentioned but vital assumption that all the CSV files you need to process are clean, with no quoted fields that contain commas, newlines, or any other separators. Luckily, as it often happens in the open source world, there is a little-known tool that's made to order to solve this problem, called csvquote [4]. It is not available as a Linux package, but its installation is deliciously simple: Download the source code, unpack it, and then type these two commands at the prompt, just like the Linux users of yore did in the 1990s:

> make
$> sudo make install

This will install the csvquote program as well as csvheader, which is a convenient script that prints out the field numbers next to the first row of data.

Using csvquote is as simple as it can be: Just put it before and after all the other tools that will do the actual processing, like this:

> csvquote input.csv | script1 | script2|... | csvquote -u > output.csv

The first call hides any troublesome characters hiding in the input file by "temporarily replacing the special characters inside quoted fields with harmless nonprinting characters that can be processed as data by regular text tools" [4].

The run at the end of the processing pipeline, with the -u option, restores all those nonprinting characters to their previous values. By default, csvquote assumes that fields are enclosed in double quotes and are separated by commas, and records are separated by the standard new line character (\n). However, it is possible to process files with different separators with the -d, -q, and -r switches.

While csvquote cannot handle multi-character delimiters, I have yet to encounter CSV files using these characters (fingers crossed). Also, by its very nature, csvquote makes it impossible for the other tools to which it is connected to replace characters as commas and newlines inside quoted fields. The workaround is to search and replace the special character csvquote puts in their place (see [4] for these special characters).

Conclusion

CSV files will never be usable as real, heavy-load databases, but they are so common and easy to process that they are impossible to ignore. The tricks I've collected here are just a small selection of the many more you can find, as I did, on online forums. (See also the "Missing: CSVfix" box). Try these tricks, and you'll be happy!

Missing: CSVfix

The main problem with CSV files is that it takes only one missing or extra separator or quote to make all the tricks mentioned in these pages fail. At that point, you are forced to visually inspect the file and fix the problem manually.

This wasn't always the case. More than 10 years ago, you could repair many of these problems (as well as lots of other stuff) with CSVfix [5]. Unfortunately, CSVfix was orphaned about a decade ago, leaving only the documentation and a bundle of source code, which, as far as I understand, would need significant work to run again on modern distributions.

This is a shame, because CSVfix could, among many other things:

  • Reorder, remove, split, and merge fields
  • Convert case
  • Trim leading and trailing spaces
  • Search for specific content using regular expressions
  • Filter out duplicate data or data on exclusion lists
  • Split large CSV files into smaller files based on field contents
  • Perform arithmetic calculations on individual fields

It would be wonderful if this tutorial encouraged a developer to take up the maintenance of this project.

Infos

  1. "Tracking Your Finances with Plain Text Accounting" by Marco Fioretti, Linux Magazine, issue 285, August 2024, p. 38, https://www.linux-magazine.com/Issues/2024/285/hledger
  2. Formulas in CSV files: https://mzonline.com/blog/2020-08/formulas-csv-files
  3. csvtool: https://github.com/maroofi/csvtool
  4. csvquote: https://github.com/dbro/csvquote
  5. CSVfix: https://wlbr.de/csvfix/

The Author

Marco Fioretti (https://mfioretti.com) is a freelance author, trainer, and researcher based in Rome, Italy, who writes about digital rights issues at https://mfioretti.substack.com.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Miller

    Miller offers a clever alternative for working with structured text files: use a single tool to replace the strings of commands built from conventional utilities like grep, cut, and sed.

  • Command Line: sort

    sort helps you organize file lists and program

    output. And if you like, you can even use this small

    but powerful tool to merge and sort multiple files.

  • Tellico

    Linux has a fantastic collection of database servers, but many users just want a simple desktop database. Tellico fills the niche.

  • datamash

    A little-known, very powerful data processor for your scripts, datamash makes long, complex calculations simple.

  • LibreOffice Music Database

    LibreOffice Calc and Base are all you need to create a simple database for organizing the songs in your music collection.

comments powered by Disqus