Useful tricks for processing CSV files
CSV and Me
© Photo by Taylor Smith on Unsplash
With a few tools, you can harness the data available in even the most troublesome CSV files.
You can say what you want to about visual programming, vibe coding, and AI. The truth is, the world is chock full of data that can be converted into simple formats and processed with simple tools.
This article describes some tricks I've collected over the years for dealing with comma-separated value (CSV) files. I'll also present a little tool that will make those tricks work even on troublesome data files.
CSV files are simply plain text files that, if necessary, you could read on any terminal with commands like cat or more or write and modify with a text editor.
Because CSV files are in plain text, they can contain any kind of data that can be expressed with words or numbers, as long as the data is organized in records made of multiple fields. For example, a file about a company's employees might contain names, birth dates, job titles, salaries, departments, and so on. A record of a CSV file (in this example, all the information about one specific employee) is written on one and only one line in which the several fields are separated by commas. The names of the fields (or the columns) representing different values within a record are in the first line of the file, also separated by commas.
The main (I'd say huge) advantage of CSV files is their simplicity. You have just read all you really need to know to start creating or processing CSV files. Don't get me wrong: Actual database-based services (i.e., payroll, gaming, e-commerce, etc.) require relational database servers that store all the data in highly optimized binary formats. However, CSV is unbeatable as a long-term, super-portable storage and interchange format that even novices can use. This simplicity, plus the billions of CSV files that already exist around the world, means that CSV processing is a skill that is easy to acquire and impossible to ignore.
In the real world, CSV files can sometimes be difficult to handle. One minor problem is that, in spite of the name, there is no guarantee that the columns of a CSV file will be separated by commas. I, for example, prefer to use the pipe character (|) in my CSV files because it makes them more readable. Tab characters are another very common choice for a column separator, so much so that files that use tabs often get a .tsv extension instead of .csv. Before using a .csv file, you should always look inside the file to determine which character is used to separate the columns.
A much bigger problem is that a data field might contain the character you are using as the file separator. Consider the CSV file in Listing 1. To the human eye, this CSV file in Listing 1 contains three data columns: the maker (Ferarri), the models (Testarossa and F40), and the minimum price (EUR240,000 – I made this price up!). Ordinary CSV parsing code, however, would find five columns in that file, not three.
Listing 1
Sample CSV File
Maker, Models, Minimum Price Ferrari, Testarossa, F40, 240,000
It gets even worse when a field contains a newline (yes, that happens too!) as shown in Listing 2. While a human could figure out that there is only one record, CSV parsers would see two incomplete records, concluding that Testarossa is a car maker, not a model.
Listing 2
CSV File with Extra Newline
# Maker, Models, Minimum Price Ferrari, Testarossa, F40, 240,000
To accommodate genuine fields that contain newlines or commas, you must put the data inside double quotes, as shown in Listing 3. If a given field contains a double quote as part of the data, you must escape every double quote character with a backslash.
Listing 3
Using Double Quotes
# Maker, Models, Minimum Price "Ferrari", "Testarossa, F40", "240,000"
Despite these conventions, there is no guarantee that a CSV file created by someone else will be properly formatted. In addition, because you can open and process CSV files with any editor, it is quite easy to introduce errors by accidentally entering a comma or pressing the Enter key prior to saving.
Luckily, this article offers a solution to improperly formatted files (as long as the fields are in quotes). Before getting into that solution, I'll cover the most common operations on CSV files with simple text-processing tools available in any Linux distribution. While you could use LibreOffice Calc (if the files contain no errors!), why do something manually that you can tell your computer to do? This is particularly true for multiple large files that would be time consuming to edit by hand.
Adding Columns
To get started, I'll show you how to add columns to a CSV file. If all you need is to place a numeric index before all the other fields, you can achieve this with any of these very simple AWK commands, which assume that the original CSV file has one header line and no empty lines:
> awk '{print (NR==1?"":n++)","$0}' original.csv > new.csv
> awk '{print (NR == 1 ? "" : NR-2) "," $0}' original.csv > new.csvIn AWK, which processes text files one line at a time, $0 is the whole line currently being parsed; $1, $2, and so on are its fields; and NR is its number. The first command tells AWK that, before printing the current line, it must print and then increment a line counter called n, unless it's on the very first line. The second command does the same thing without using an auxiliary counter, because it prints NR itself, minus 2.
Things get a bit more complicated if you need to insert a new column between two existing ones. To insert a column that has always the same value (Hello!) right after the first column in the original file (for simplicity, I'll assume only three columns), you would use
> awk 'BEGIN { FS = "," ; OFS = "," } ; {print $1, "HELLO!", $2, $3}' original.csv > new.csvThis will transform CSV records like p,4.6,Bill into p,HELLO!,4.6,Bill. Of course, the constant value of the new column could be anything, from the date of a new file revision to an empty placeholder string that later will be filled manually. The BEGIN block, which may contain any number of statements, tells AWK what to do, or know, before it starts actually reading the input file. In this case, it says that both the input field separator (FS) and the output separator (OFS) are commas. Setting OFS to the pipe character, instead, would have transformed the sample line above into p|HELLO!|4.6|Bill. You can use the BEGIN statement also to print whole new headers for the column names. Last but not least, please note that you also can use the syntax above to rearrange columns in any order you want.
The obvious limit to this approach is that you cannot put different values in every field of the new column. To do that, write all the new values in a separate file first, in the right order, one per line as shown in Listing 4. Then use the paste command with Bash sub-processes to put it in a new column:
Listing 4
The extracolumn.csv File
Hello Linux Magazine Readers
> paste -d',' <(cut -d',' -f-1 original.csv) extracolumn.csv <(cut -d',' -f3- sample.csv)
Here, the paste command is told to merge three separate "files," treating each one as a separate column, using commas as column delimiters (that's what the -d option does). The second file is the one shown in Listing 4 (extracolumn.csv). The other two are not real files (original.csv and sample.csv), but rather the outputs of two separate Bash sub-processes, created by the parentheses.
Inside each of those sub-processes, the cut command extracts (from the file it's given) the column (or columns) specified with the -f switch, using the separator attached to the -d switch. Then, using the < redirection operator, the whole result is passed to the main command as a temporary file. What's cool here is that each sub-process could also contain Bash code, or any script, that generates a column on the fly. Inserting this bit of code:
<(for i in `seq 1 30`; do echo; done)
in place of extracolumn.csv in the previous command, would give the new file a second column with a progressive index number. Alternatively, you could insert a script in that sub-process to query a database in any way you want.
What if the values in the column to be added depend on the values of the other columns? For example, how could you add a column with the sum of all the previous numeric columns to the file shown in Listing 5?
Listing 5
numbers.csv
A,15,-3 B,21,7
One way to do this is, again, with AWK:
> awk 'BEGIN {FS=","; OFS=","} {sum=0; for (i=2; i<=NF; i++) {sum+= $i} print $1, $2, $3, sum}' numbers.csv > sum-in-column.csvHere, the sum variable is reset to at every line and then incremented with the values of all the fields in that line, from the second to the last field, before printing everything. NF is AWK's variable that holds the numbers of fields in a line.
You can use a similar technique to add an extra row with the totals of each column:
> awk 'BEGIN {FS=","; OFS=","} {for(i=2;i<=NF;i++)a[i]+=$i;print $0} END{l="SUM";i=2;while(i in a){l=l","a[i];i++};print l}' numbers.csvWhile parsing every line, the code adds the value of the corresponding field of the current row to each element of the a array and then prints it ($0 is the AWK way to say "whole line"). When the file is finished, the END part of the code kicks in, printing the word SUM and then the elements of the a array, separated by commas, as shown in Listing 6.
Listing 6
with an Extra Row
A,15,-3 B,21,7 SUM,36,4
Before moving to a very different technique, it's worth noting that AWK can also remove columns very easily, for example, the first two columns:
> awk '{$1=$2=""; print $0}' somefile.csvby simply making their values empty, before printing the current line.
Add Formulas in CSV Files
Almost every user of graphical spreadsheet software like LibreOffice Calc knows that it can load CSV files as well as save spreadsheets in that format. Many of those users also believe that CSV files cannot contain commands or dynamic data like formulas. That's not true, however. As mentioned earlier, CSV files can contain any kind of data that can be expressed with words or numbers, which includes ordinary spreadsheet formulas. As most users know, LibreOffice Calc (or any other spreadsheet software) lets you enter formulas that automatically recalculate any time a cell changes that the formula points to. As shown in Figure 1, those formulas are combinations of letters, numbers, parentheses, and punctuation.
However, many users overlook (or may have never noticed) LibreOffice's ability to automatically process CSV files, as shown in the LibreOffice dialogs in Figures 2 and 3. Figure 2 shows the dialog where users can specify how to save the active sheet of their current spreadsheet in CSV format. Besides character encoding, field delimiter, and quoting, it's also possible to select Save cell formulas instead of calculated values.
Ticking that option before saving the spreadsheet in Figure 1 as a CSV file will produce a text file with exactly that formula in the first field of the first row. If, later, you ask LibreOffice to open that file, you'll be prompted to fill out the dialog in Figure 3. If you select Evaluate formulas in that dialog, LibreOffice will treat every string starting with an equal (=) character as the formula you entered manually in its interface.
Learning the formula operators and syntax that LibreOffice uses is really simple and supported by tons of tutorials and examples online. You can use that knowledge, and any text processing tool available on Linux, to automatically generate CSV files containing any formula you want, and any software like LibreOffice Calc (including Microsoft Excel) will recognize and treat that CSV file as if they created the file instead.
For the record, this is what I do with my tax returns. If you read my plain text accounting tutorial [1], you may remember that I use two CSV files, income.csv and expenses.csv (using pipes instead of commas), with the structure shown in Listing 7.
Listing 7
Snippet of expense.csv File
2024-12-18| -100- | accountant fees 2024-12-24| -263.72 | taxes
When it's time to send my accountant all the information he needs to prepare my tax return, I have a script that concatenates those two files, calculates the number and position of all the income or expense lines, and then appends to the end lines (where the actual row numbers are calculated on the fly) the following:
Total income|"=sum(B2:B25)"| Total expenses|"=sum(B26:B38)"|
My actual script is more complicated than that, because both income and expenses are partitioned in several sub-categories, but you get the drift: Once the script has done its job, I can open the result with LibreOffice Calc, quickly check its content to be sure everything is OK, then save the file in ODS format (LibreOffice's default), and send it to my accountant. You can find more examples of script-based generation of formulas in CSV files online [2].
Strictly speaking, there is no need to automate this or any other spreadsheet-based workflow with CSV files. You could do the same things directly in LibreOffice's native format, OpenDocument, because its files are little more than ZIP archives of XML files (that is, plain text again). Such files, however, are really, really verbose. The beauty of CSV files is that you can get basically the same results by generating or processing much less text, with much simpler means, using some specialized tools.
csvtool
The command-line utility csvtool [3] (available in the repositories of the major Linux distributions) can merge CSV files in several ways and is written specifically to deal with many CSV processing and formatting tasks. While I prefer more generalist tools like AWK and Perl scripts to accommodate the extra processing I require, cvstool is worth knowing because it has certain options that are much harder to replicate with AWK one-liners and are therefore much faster to include in shell scripts. For example, if you only need to extract certain columns from a CSV file, just write
> csvtool col 1-3,6 input.csv
If you just need the columns titled Address and PhoneNumber, you can use the namedcol command:
> csvtool namedcol Address,PhoneNumber addressbook.csv
Also useful is the width command, which returns the number of columns in the widest row. Then there are commands that you may seldom need but that can save the day, like setcolumns cols, which makes all rows have the same number of columns, adding blank fields to short rows and truncating those which are too long. Finally, with csvtool you can transpose all the lines and columns of a CSV file by just typing
csvtool transpose input.csv
Buy this article as PDF
(incl. VAT)
