- Page: 1.3 » Linux Magazine

Useful tricks for processing CSV files

csvquote

All the tricks shown so far rely on the seldom mentioned but vital assumption that all the CSV files you need to process are clean, with no quoted fields that contain commas, newlines, or any other separators. Luckily, as it often happens in the open source world, there is a little-known tool that's made to order to solve this problem, called csvquote [4]. It is not available as a Linux package, but its installation is deliciously simple: Download the source code, unpack it, and then type these two commands at the prompt, just like the Linux users of yore did in the 1990s:

> make
$> sudo make install

This will install the csvquote program as well as csvheader, which is a convenient script that prints out the field numbers next to the first row of data.

Using csvquote is as simple as it can be: Just put it before and after all the other tools that will do the actual processing, like this:

> csvquote input.csv | script1 | script2|... | csvquote -u > output.csv

The first call hides any troublesome characters hiding in the input file by "temporarily replacing the special characters inside quoted fields with harmless nonprinting characters that can be processed as data by regular text tools" [4].

The run at the end of the processing pipeline, with the -u option, restores all those nonprinting characters to their previous values. By default, csvquote assumes that fields are enclosed in double quotes and are separated by commas, and records are separated by the standard new line character (\n). However, it is possible to process files with different separators with the -d, -q, and -r switches.

While csvquote cannot handle multi-character delimiters, I have yet to encounter CSV files using these characters (fingers crossed). Also, by its very nature, csvquote makes it impossible for the other tools to which it is connected to replace characters as commas and newlines inside quoted fields. The workaround is to search and replace the special character csvquote puts in their place (see [4] for these special characters).

Conclusion

CSV files will never be usable as real, heavy-load databases, but they are so common and easy to process that they are impossible to ignore. The tricks I've collected here are just a small selection of the many more you can find, as I did, on online forums. (See also the "Missing: CSVfix" box). Try these tricks, and you'll be happy!

Missing: CSVfix

The main problem with CSV files is that it takes only one missing or extra separator or quote to make all the tricks mentioned in these pages fail. At that point, you are forced to visually inspect the file and fix the problem manually.

This wasn't always the case. More than 10 years ago, you could repair many of these problems (as well as lots of other stuff) with CSVfix [5]. Unfortunately, CSVfix was orphaned about a decade ago, leaving only the documentation and a bundle of source code, which, as far as I understand, would need significant work to run again on modern distributions.

This is a shame, because CSVfix could, among many other things:

Reorder, remove, split, and merge fields
Convert case
Trim leading and trailing spaces
Search for specific content using regular expressions
Filter out duplicate data or data on exclusion lists
Split large CSV files into smaller files based on field contents
Perform arithmetic calculations on individual fields

It would be wonderful if this tutorial encouraged a developer to take up the maintenance of this project.

Infos

"Tracking Your Finances with Plain Text Accounting" by Marco Fioretti, Linux Magazine, issue 285, August 2024, p. 38, https://www.linux-magazine.com/Issues/2024/285/hledger
Formulas in CSV files: https://mzonline.com/blog/2020-08/formulas-csv-files
csvtool: https://github.com/maroofi/csvtool
csvquote: https://github.com/dbro/csvquote
CSVfix: https://wlbr.de/csvfix/

The Author

Marco Fioretti (https://mfioretti.com) is a freelance author, trainer, and researcher based in Rome, Italy, who writes about digital rights issues at https://mfioretti.substack.com.

« Previous 1 2