Complex Containers

Tutorials – Bash Arrays

Article from Issue 220/2019
Author(s):

Discover the data structures that make your shell scripts really powerful.

In the first installment of this tutorial [1], I described the main properties of shell variables, which are the basic data structures of the Bash shell. This time, I am going to show you how to handle more complex containers for your data, which make much harder (and cooler!) things possible: Bash arrays.

Note: Unlike almost anything else in the other installments of this tutorial, most of what you will learn here only works with Bash v4 or later. If you are using any reasonably modern Linux distribution, this should not be a problem. However, just to be sure, check your version by typing

echo $BASH_VERSION

at the prompt.

Strictly speaking, a Bash array is still a variable, meaning a data container with a unique name, at least in the script or programming scope in which it is called. The Advanced Bash-Scripting Guide [2] contains more shell array usage examples than you may ever use or want to know, but you do not need to learn them all to make productive use of these structures.

A Bash array's defining property is that each array can contain multiple values, each with its own distinct identifier. Arrays are the tools that Bash puts at your disposal to aggregate multiple objects and treat them as one entity, while preserving the ability to distinguish among them. Basically, you can use arrays to keep all the values of any imaginable "set" or "group" together.

In practice, the first thing to know about Bash arrays is that there are two types: plain arrays (which I will simply call arrays) and associative arrays (hashes). Each array or hash can contain values of different types, without built-in limits to their size.

The difference between arrays and hashes is the way their single elements are referenced. Arrays address each element with a numeric identifier, starting from zero. Inside hashes, on the other hand, each value is associated with a key, which may be any text string.

That difference corresponds to two, potentially very different use cases. Use arrays when all that matters is ordering (i.e., whenever all you need to know about each component of a set is its position in the sequence). When what matters is the relationship between pairs of elements, use hashes instead. To understand the difference, compare Table 1 and Table 2. Although Table 1 is ordered alphabetically for our reading convenience, its purpose is to associate each city with one of its own intrinsic properties, regardless of other cities' properties. Table 2, on the other hand, is all about ranking. If Madrid had only three residents and Rome had two, only the first table would change. Therefore, you use a hash to store Table 1 and an array to store Table 2.

Table 1

Population of EU Capitals (2012)

Berlin

3.5M

London

7.4M

Madrid

3.2M

Rome

2.6M

Table 2

EU Capitals, Ranked by Population (2012)

1) London

2) Berlin

3) Madrid

4) Rome

Now, I'll show you how to process arrays and hashes, before dealing with a more complex, real-world script that uses them. First of all, you have to declare each structure with the proper flag:

declare -a eu_capitals_by_population # -a => array
declare -A population_of_eu_capitals # -A => hash

Then, you fill arrays or hashes with this syntax:

eu_capitals_by_population=('London' 'Berlin' 'Madrid' 'Rome')
population_of_eu_capitals=( [Berlin]="3.5M" [London]="7.4M" [Madrid]="3.2M" [Rome]="2.6M" )

You may also populate arrays or hashes dynamically, with the output of other scripts, Linux commands, or text files, as long as the data is in the necessary format. For example, use

eu_capitals_by_population=$( ./some_other_script.sh )

to initialize the array of EU capitals with the output of some otherscript.sh. The simplest way to load the current line of a plain text file you are reading in your script into $myarray is as follows:

read -a myarray <<< $line

(See the first installment of this series [1] and then references [3] and [4].)

Another, perhaps faster, way to load values from files or scripts into a plain array is the built-in Bash command, mapfile [5]. I will not cover mapfile here, partly because it is not very portable and partly because, very frankly, I have never found myself compelled to use it in actual work. However, those reasons shouldn't stop you from checking out mapfile.

To use the current element of an array or hash, write them as follows:

#> echo ${eu_capitals_by_population[3]}
Rome
#> CAPITAL='Berlin'
#> echo ${population_of_eu_capitals[$CAPITAL]}
3.5M

Instead of $CAPITAL, I could have used the string Berlin, with or without quotes, to achieve the same effect. However, for both arrays and hashes, do not forget the curly braces; otherwise Bash will not interpret what is inside the square braces properly!

To delete an element, just pass its identifier, formatted as above, to the unset command. Don't forget the index, or the key if it is a hash! Calling unset with just an array or hash name will erase that whole data structure. Instead, to quickly remove all and only the elements that match a certain pattern, use regular expressions:

#> echo ${eu_capitals_by_population[@]}
London Berlin Madrid Rome
#> declare -a ifbrexit=( ${eu_capitals_by_population[@]/London/} )
#> echo ${ifbrexit[@]}
Berlin Madrid Rome

The $ifbrexit array is created by copying into it all the elements of $eu_capitals_by_population that do not contain the string London.

Once arrays or hashes have been created, you can add elements both singularly or in bulk. Here is one example for hashes and two for plain arrays:

#> population_of_eu_capitals[Lisbon]='1M'
#> eu_capitals_by_population+=("Paris" "Bucharest")
#> eu_capitals_by_population[9]='Sofia'

As you can see, you can append to an array multiple elements (or just one element) with one command, but with whatever index you desire. An alternative way to append elements to an array is as follows:

eu_capitals_by_population=("${eu_capitals_by_population[@]}" "Paris" "Bucharest")

This is more flexible than the previous example, because it also may be used to prepend elements to an array or to merge multiple arrays into one.

To list all of an array's elements, in order, or its existing indexes (adding ! in the right place), use the * or @ wildcards. If you are only interested in one slice of the array, specify the starting element and how many elements to fetch:

#> echo ${eu_capitals_by_population[*]}
London Berlin Madrid Rome Paris Bucharest Sofia
#> echo ${!eu_capitals_by_population[*]}
0 1 2 3 4 5 9
#> echo ${eu_capitals_by_population[@]:2:3}
Madrid Rome Paris

Using the same trick with a hash would return all its values and, keys, in whatever order Bash stored them internally.

As far as arrays are concerned, please have another look at the second-to-last result, which is a direct consequence of what I did to the $eucapitals_by_population array: First, I dumped four cities in the array, then I appended two more with one command, and finally I deliberately created a seventh one (Sofia) with a non-consecutive index. This brought the total number of elements of that array to seven, but without defining any indexes between 6 and 8. In programming courses, arrays that do not need to have consecutive arrays are called sparse; it is up to you to be aware of this Bash array feature.

In general, omitting the index of an array makes Bash use only its first element. The first two commands of this sequence would both load into $UK_capital the string 'London'

#> UK_capital=eu_capitals_by_population
#> UK_capital=eu_capitals_by_population[0]
#> echo ${#eu_capitals_by_population}
6
#> echo ${#eu_capitals_by_population[@]}
7

while the last two commands return the number of characters in 'London' and, the number of elements in the whole array.

Processing Whole Arrays or Hashes

Complex data structures would be almost useless if one could not quickly loop through them and process one element at a time, with as little code as possible. In fact, the syntax to do just that is relatively simple and basically the same for both arrays and hashes. The only meaningful difference is in the output: Arrays are by default sorted, and consequently processed, from (numerically) lower to higher indexes. Hashes, instead, are processed in whatever order the shell stored each key-value pair internally. For example, this loop over the $population_of_eu_capitals hash

for city in "${!population_of_eu_capitals[@]}"
do
  echo "Population of $city is ${population_of_eu_capitals[$city]}"
done

returns just what one would expect, because the exclamation mark in the first line means "use all the keys of this hash." However, the lines are not sorted in any way:

Population of Berlin is 3.5M
Population of Madrid is 3.2M
Population of Rome   is 2.6M
Population of London is 7.4M

I am going to show you how to get around this hash limitation in a moment, but first I want to show you one last trick I found online [6], which can be useful in many different situations.

Can you easily generate a hash that is the "reverse" of an existing hash (i.e, another hash with keys used as values and vice versa)? Of course you can! Just adapt the following code

declare -A eu_capital_with_population
for city in "${!population_of_eu_capitals[@]}"; do
    eu_capital_with_population[${population_of_eu_capitals[$city]}]=$city
done
for ppl in "${!eu_capital_with_population[@]}"; do printf "%s people live in %s\n" "$ppl" "${eu_capital_with_population[$ppl]}"; done

which, in our case, returns

7.4M people live in London
2.6M people live in Rome
3.2M people live in Madrid
3.5M people live in Berlin

Of course, this specific example is not especially helpful, because I could have generated the same output from the original hash. However, swapping keys with values can be useful in some situations. I'll now move on to more complex stuff and a complete, useful script.

Sorting and Multidimensional Arrays

Bash is very powerful, but when it comes to sorting arrays and hashes, especially in non-basic ways, it is no match for Perl (probably other languages, too). That said, I hope to prove that Bash is more than adequate for basic and not-so-basic data structure processing.

When you begin using arrays or hashes, two questions surely arise: How do you create (and process) multidimensional ones (e.g., hashes of hashes)? And how do you sort the keys of hashes when you loop through them?

The easiest (but not the only [7]) answer to both questions, is to cheat.

You can emulate hashes of hashes by using ordinary hashes with "composite" keys and sort the result of looping over a hash after it is finished. Both tricks are shown in a script that follows.

A Photo Archive Statistics Generator

Years ago, I became the de-facto maintainer of the complete, unified archive of digital photographs (Figure 1) ever taken by my entire extended family (this is what you get when relatives discover you are a Linux geek!). In order to make sense of some tens of thousands of files, I needed summaries of how many photographs were already geotagged, which places had been already included in the archive, and of which places we had the most photographs. Listing 1 shows a partial output of the script I put together to scratch this itch. The script is shown in Listing 2.

Listing 1

Partial Output of

Distribution of photos By year:
Total photos taken in 1989:  125
  Grecia, Atene, Acropoli   :   39
  Grecia, Capo Sunio        :   20
...
Total photos taken in 1990:  256
  Sardegna, Isola Rossa     :   88
  Roccacalascio             :   65
...
Total photos taken in 1998:   24
  San Francisco             :   24
Total photos taken in 1999:   40
  home, sweet home          :   23
  in-laws home              :   17
Total photos taken in 2000:  396
  in-laws home              :  111
  home, sweet home          :  106
...
Total photos taken in 2001:  184
  home, sweet home          :   65
  Sardegna, Cannigione      :   57
  parents home              :   31
  in-laws home              :   19
  kids daycare              :   12
10 most photographed places from 1989 to 2001 :
home, sweet home            :  197
in-laws home                :  147
Sardegna, Isola Rossa       :   88
...
By place, alphabetic order:
Abruzzo National Park       :    5
Calcata Vecchia             :    2
Campo Ceraso, Volubro       :   37
....
Photos not geotagged yet: 1

Listing 2

Photo Archive Statistics Generator

01 #! /bin/bash
02
03 declare -A coordinates_of
04 declare -A name_of
05 declare -a photos_per_year
06 declare -A photos_per_place
07 declare -A photos_per_year_in_place
08
09 PHOTODIR=/home/z/photo/marco
10 TMPDIR=/tmp/photostats
11
12 cd $PHOTODIR
13
14 for summary in  `find . -type f -name ".tags.summary*" | cut -c3- | sort  `
15 do
16   d=`dirname $summary`
17   y=`echo $d | cut -d/ -f1`
18   n=`basename $summary | cut -c2-`
19   cut '-d|' -f1,4  $summary | sed -e 's/ *$//g'| grep -v '|$' | grep -v ^# | sort | uniq > $TMPDIR/$n
20
21   while IFS= read -r  line
22   do
23   photoname=`echo $line | cut '-d|' -f1 | sed -e 's/ //g'`
24   photoplace=`echo $line | cut '-d|' -f2 | sed -e 's/^ *//g'`
25   gps=`exiftool $d/$photoname | grep '^GPS Position' | sed -e 's/^.*: //'`
26   coordinates_of[$photoplace]=$gps
27   name_of[$gps]=$photoplace
28   done < $TMPDIR/$n
29
30   for photo in `find $PHOTODIR/$d -type f -iname "*jpg"`
31   do
32   let "total_photos++"
33   echo "Processing photo n. $total_photos"
34   current_position=`exiftool $photo | grep '^GPS Position' | cut -d: -f2 | cut -c2-`
35   if [[ ! -z "${current_position// }" && ! -z "${name_of[$current_position]}"  ]]
36   then
37     let "photos_per_year[$y]++"
38     let "photos_per_place[${name_of[$current_position]}]++"
39     let "photos_per_year_in_place[$y ${name_of[$current_position]}]++"
40   else
41       let "not_geotagged_photos++"
42   fi
43   done
44 done
45
46 printf "\n\nDistribution of photos\n\nBy year:\n\n"
47
48 for year in "${!photos_per_year[@]}"
49 do
50   printf "\nTotal photos taken in $year: %4.4s\n\n" ${photos_per_year[$year]}
51   for place in "${!photos_per_place[@]}"
52   do
53   printf "\t\t%-45.45s : %4.4s\n" "$place" ${photos_per_year_in_place[$year $place]}
54   done | sed -e 's/: *$//' | grep ':' | sort -t ':' -k2,2rn
55 done
56
57 printf "\n\n10 most photographed places from %s to %s  :\n\n" \
58     `echo ${!photos_per_year[*]} | sed 's/\s.*$//'`    \
59     `echo ${!photos_per_year[*]} | sed 's/^.*\s//'`
60
61 for place in "${!photos_per_place[@]}"
62 do
63   printf "%-45.45s : %4.4s\n" "$place" ${photos_per_place[$place]}
64 done | sort -t ':' -k2,2rn | head -10
65
66 printf "\n\nBy place, alphabetic order:\n\n"
67
68 for place in "${!photos_per_place[@]}"
69 do
70   printf "%-45.45s: %4.4s\n" "$place" ${photos_per_place[$place]}
71 done | sort
72
73 printf "\nPhotos not geotagged yet: $not_geotagged_photos\n\n"
74 exit
Figure 1: Digital photography is great, but it brings so many files, with so much metadata, into your computer that GUI applications alone have a hard time aggregating and presenting the information in useful ways.

Before explaining the code, please take note of the hidden, but crucial message in Listing 1: Using arrays and other Bash script features allows you to quickly extract quantitative and qualitative data from very large, messy files of all kinds, in formats that greatly facilitate further processing. It would be easy, for example, to rearrange the numbers above as one .csv file that could be used to generate charts. And you could extract with the same techniques any number or text, from stock market quotes to book titles.

The code for the photo archive statistics generator script in Listing 2 is simple to understand – once you know how digital pictures are geotagged and how I catalog them on my computer (Figure 2). Here is the bare minimum you need to know to keep going.

Figure 2: digiKam knows that each geotagged digital file includes its own GPS coordinates, in plain text format. However, it takes a shell script like Listing 2 to use that data in any meaningful way.

GPS coordinates are stored inside JPG files in the following format, which can be read with the exiftool program:

GPS Position : 37 deg 58' 18.02" N, 23 deg 43' 33.76" E'

In any folder I have already geotagged, I keep a plain text index called .tags.summary. that lists each picture's filename, author, and place together with other data in pipe-separated fields:

19980110110000.jpg | = | Marco|San Francisco| other fields here...

In other words, the files only contain coordinates, which I need to map to human-readable place names that are only found in the indexes. Moving on to the script in Listing 2, the first 10 lines declare all the hashes and arrays that are needed to store data. The loop in lines 15 to 44 finds all the folders that contain indexes (i.e., those folders already geotagged) and does three things.

First (lines 16 to 19), it saves into a temporary file named after the folder subject ($TMPDIR/$n) only the index fields that correspond to the name and place of each picture, in this format:

19891010110000.jpg | Grecia, Atene, Acropoli

This is what happens in line 19, while line 17 extracts the year value from the folder name. The purpose of lines 21 to 28, which read the newly created temporary file, is to load into the complementary hashes, $coordinates_of and $name_of, each place's GPS coordinates and name. To get the $photo_place, I split the current $line of the file using a pipe as a separator (cut '-d|'); to get the GPS coordinates (line 25), I use grep to filter the line starting with GPS Position (with grep) from the output of exiftool and remove the initial string with sed. In this way, $name_of and $coordinates_of will be filled with values like these (using bogus coordinates for brevity!):

coordinates_of['San Francisco']='37 deg N, 75 deg W'
name_of['37 deg N, 75 deg W']='San Francisco'

The loop in lines 30 to 43 completes the gathering of raw data by looking again at all pictures in the current folder (line 30). Line 33 increases the $totalphotos counter. Line 34 saves into $currentposition the GPS coordinates of the current pictures, in the same format used to fill the $name_of hash.

Line 35 checks that $currentposition is not an empty string (i.e., that the current photo did contain a GPS Position field), and that such a position already exists in $nameof. If that is not the case, the counter of photos not yet geotagged is incremented (line 41). Otherwise (lines 37-39), I increase the counters of photos taken in the current $year and in the current place (line 38) and then resort to the first dirty trick I mentioned above. I need a hash of hashes as follows:

YEAR: 1989:
    Pictures in
      Greece: 49
      home:   25
YEAR: 1990:
    Pictures in
      Sardegna: 81
      home: 54
etc...

To emulate it in Bash, I create a standard hash, but with composite keys like "1989 home" (line 39). As dirty as it is, it does the job. Once all folders have been scanned, all that happens from line 46 to the end is pretty printing of the totals previously stored in the several hashes. Only six lines need a thorough explanation. Line 54 is where, looping over both the indexes of $photos_per_year and the keys of $photos_per_place, I use the "composite key" trick to retrieve, from $photos_per_year_in_place, the total number of pictures taken during $year in each $place. Lines 54, 64, and 71 all show how to sort the output of a loop over hash keys: You feed it to the sort command purging and filtering with sed and grep if needed. Please see sort's man page [8] to see how the -k option does the numeric sorting.

Last, but not least, the arguments of the print function in lines 58 and 59: The problem here is to find the smallest and biggest indexes of the sparse array $photos_per_year. You already know by now that writing '${!photos_per_year [*]}' returns one string like "1989 1990 2001", containing ordered but not consecutive numbers. Feeding that string to sed removes everything after the first space (line 58) or everything before the last space (line 59), and that is how I generate the "from 1989 to 2001" part of the script output (Listing 1). Wasn't that fun? Happy hacking!

The Author

Marco Fioretti (http://mfioretti.com) is a freelance author, trainer, and researcher based in Rome, Italy. He has been working with free/open source software since 1995 and on open digital standards since 2005. Marco also is a Board Member of the Free Knowledge Institute (http://freeknowledge.eu).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News