Useful tricks for processing CSV files
csvquote
All the tricks shown so far rely on the seldom mentioned but vital assumption that all the CSV files you need to process are clean, with no quoted fields that contain commas, newlines, or any other separators. Luckily, as it often happens in the open source world, there is a little-known tool that's made to order to solve this problem, called csvquote [4]. It is not available as a Linux package, but its installation is deliciously simple: Download the source code, unpack it, and then type these two commands at the prompt, just like the Linux users of yore did in the 1990s:
> make $> sudo make install
This will install the csvquote program as well as csvheader, which is a convenient script that prints out the field numbers next to the first row of data.
Using csvquote is as simple as it can be: Just put it before and after all the other tools that will do the actual processing, like this:
> csvquote input.csv | script1 | script2|... | csvquote -u > output.csv
The first call hides any troublesome characters hiding in the input file by "temporarily replacing the special characters inside quoted fields with harmless nonprinting characters that can be processed as data by regular text tools" [4].
The run at the end of the processing pipeline, with the -u option, restores all those nonprinting characters to their previous values. By default, csvquote assumes that fields are enclosed in double quotes and are separated by commas, and records are separated by the standard new line character (\n). However, it is possible to process files with different separators with the -d, -q, and -r switches.
While csvquote cannot handle multi-character delimiters, I have yet to encounter CSV files using these characters (fingers crossed). Also, by its very nature, csvquote makes it impossible for the other tools to which it is connected to replace characters as commas and newlines inside quoted fields. The workaround is to search and replace the special character csvquote puts in their place (see [4] for these special characters).
Conclusion
CSV files will never be usable as real, heavy-load databases, but they are so common and easy to process that they are impossible to ignore. The tricks I've collected here are just a small selection of the many more you can find, as I did, on online forums. (See also the "Missing: CSVfix" box). Try these tricks, and you'll be happy!
Missing: CSVfix
The main problem with CSV files is that it takes only one missing or extra separator or quote to make all the tricks mentioned in these pages fail. At that point, you are forced to visually inspect the file and fix the problem manually.
This wasn't always the case. More than 10 years ago, you could repair many of these problems (as well as lots of other stuff) with CSVfix [5]. Unfortunately, CSVfix was orphaned about a decade ago, leaving only the documentation and a bundle of source code, which, as far as I understand, would need significant work to run again on modern distributions.
This is a shame, because CSVfix could, among many other things:
- Reorder, remove, split, and merge fields
- Convert case
- Trim leading and trailing spaces
- Search for specific content using regular expressions
- Filter out duplicate data or data on exclusion lists
- Split large CSV files into smaller files based on field contents
- Perform arithmetic calculations on individual fields
It would be wonderful if this tutorial encouraged a developer to take up the maintenance of this project.
Infos
- "Tracking Your Finances with Plain Text Accounting" by Marco Fioretti, Linux Magazine, issue 285, August 2024, p. 38, https://www.linux-magazine.com/Issues/2024/285/hledger
- Formulas in CSV files: https://mzonline.com/blog/2020-08/formulas-csv-files
- csvtool: https://github.com/maroofi/csvtool
- csvquote: https://github.com/dbro/csvquote
- CSVfix: https://wlbr.de/csvfix/
« Previous 1 2
Buy this article as PDF
(incl. VAT)
