## Hashes, salt, and pepper

# Salt and Pepper

Cryptographic hash functions help you protect your passwords, but hashing is only secure if properly understood.

Hash functions are an integral part of computer science – and not just with databases and checksums. Hashes were originally intended for storing data efficiently in memory, but the hashing concept has evolved into a technique for securely storing passwords.

Linux writes the password hash values to the `/etc/shadow`

file, which you can only read if you have root privileges. But even if you have the root password, you'll find it difficult to learn any useful access information. The function used to store the password hash values in `etc/shadow`

is a one-way function, which means you can't work backward from the hash value to create the original password – at least in theory. As you'll learn in this article, attackers still sometimes manage to crack these supposedly irreversible hash functions.

#### What is a Hash?

The idea of a hash is simple: An address is calculated from the value that is to be stored. Suppose, for example, you need to store the four user names *Fritz*, *Laempel*, *Max*, and *Moritz*. A hash function would calculate a numeric value from these names.

The simplest option is to use letters in the alphabet (A=1, B=2, etc.), which the program then uses to form a sum of the digits from the letter values in the name (Table 1). The hash value must be in this range because only a limited number of memory locations are available.

Table 1

Simple Hash Function

Value | Sum | Hash |
---|---|---|

Fritz |
79 |
2 |

Laempel |
64 |
1 |

Max |
38 |
3 |

Moritz |
101 |
3 |

The example in Table 1 has seven available memory locations (from 0 to 6). The hash value is therefore determined by the sum modulo 7. Anyone who doesn't fancy computing the numerical values for *Fritz*, *Laempel*, *Max*, and *Moritz* manually can use Bash with the help of Perl, as shown in Listing 1.

Listing 1

Bash Hash

Each name first ends up in the `name`

variable via the `for`

loop; `echo`

sends the name through a pipe to `tr`

, which replaces all lowercase and uppercase letters and passes the result to a Perl script. The Perl script then splits the character string into individual letters and outputs its ASCII value separated by the `- 64 +`

character string. This gives A the value 1, B the value 2, and so on. Anyone who wants to see this happening live should enter `set -x`

. To terminate the function again use `set +x`

(Figure 1).

In this example, Max and Moritz have the same hash value. If you imagine memory as an array, you can only accommodate one entry in each slot and will be unable to handle the task at hand. Theoretical computer science therefore doesn't consider the hash value to be a physical slot, but rather a number of a bucket that can accommodate several entries.

With open hashing (Figure 2), Max and Moritz could live in one bucket because each bucket can contain any number of entries. However, the number of entries is limited with closed hashing.

An array of pointers, which can (in principle) contain an unlimited number of elements, is a suitable data structure for open hashing. At each index of the array a linked list to the elements of this bucket is kept. Open hashing is always useful if the size of the array is comparable to the number of elements, *n*, for all the values you want to store, because then, on average, only one entry per index is stored, allowing for quick access.

The hash function requires a constant time O(1) to calculate the hash value. However, the time it takes to wade through the list in each bucket increases as a linear function of the number of elements. The worst case scenario is that all *n* elements end up in one bucket and the required access time is now O(*n*). In that case, this method offers no advantage over a singly linked list.

The hash function will ideally distribute the number of values equally across all *m* indices [O(*n*/*m*)], which makes finding elements faster than with a singly linked list. Such a structure is particularly useful if you're looking for a specific value from a large amount of data. Closed hashing limits the number of entries in a bucket and thereby prevents long chains.

#### Keep a Lid on It

Many real implementations draw the line for the maximum number of entries per bucket at 1; however, higher values are also theoretically possible. The simplest implementation is a two-dimensional array with *n* number of rows (=buckets) and *m* number of columns for the maximum number of elements per bucket. However, this solution isn't ideal for memory usage.

Collisions are now no longer irrelevant; you still have to check whether the bucket has room for another value.

Some methods provide a new hashing (called rehashing) with another hash function as the solution. This rehashing is repeated until no more collisions occur. The worst case scenario is that you'll need a lot of new hash functions, which give rise to different results to find a free space.

The simplest hash functions always add 1 to the previous hash value; *h*[ ] (*h* subscript 0) is the original function, *n* is the number of buckets, and *i* is the *i*th rehashing attempt:

This process is called linear probing and is a bit like a visit to an Oktoberfest tent: People go from table to table until they find an empty seat. This process causes block formations. In places where everything is already occupied, the next table is likely to be occupied too.

Therefore refinements alternate between looking above and below. Square probing using *i*^2 (*i* squared) instead of *i* is a possible alternative. Collisions are equally likely with the first hash, but the likelihood decreases with rehashing.

Rehashing leads to another problem: Anyone looking for an entry in memory later (Max for example) just needs to apply the function to the name and will then receive the memory address. However, if nothing is stored at this address, you need to keep rehashing until you find Max or an empty or only partially filled bucket, which would mean that Max wasn't saved.

If an entry has been deleted, you need to set a marker at this position so the search is terminated prematurely. The memory space should be less than 80 percent full to keep the probability of collisions low enough.

#### Adding Security

Cryptography places higher demands on hash functions. A hash function is considered secure if it isn't possible to recreate the output value from the hash value – at least not in an acceptable time. In other words, the function must not be reversible. The probability of finding a second output value for the given hash and the probability that any two output values lead to the same hash value should be very low.

The last two stipulations sound similar, but they differ in a way that is best illustrated with the birthday paradox: Suppose Max is at a party and asks the other guests if anyone's birthday is on the same day as his. An affirmative response is relatively unlikely. If, on the other hand, he wanted to know if any two arbitrary partygoers' birthdays are on the same day, the probability is more than 50 percent, even with as few as 23 people at the party. For the first scenario, 183 people would be needed to get such a value. So, from a cryptographic point of view, birthdays are a bad hash value, because a very large number of people (7 billion) map to a very small range of values (365 days).

The range of values for a cryptographically secure hash function is a fundamental part of the design. MD5, for example, uses 128 bits (2^128, i.e., about 3 x 10^38 different values). SHA-1 uses 160 bits (10^48 different values), SHA-256 as many as 256 bits (10^77). A function is considered to be collision-resistant if the probability of a collision is less than 2^(*n*/2) with a hash value of *n* bits.

Like all hash processes, the MD5 algorithm first converts a message to the appropriate length. An integer multiple of 512 bits is required. Padding works in such a way that 1 is appended to the message as the first bit, followed by zeros, until 64 bits remain free at the end. The message length is therefore divisible by 512. A 64-bit value appended to the end (in binary-coded form) specifies how long the original message was [1].

The algorithm then splits the message up into 512-bit blocks and splits these blocks, in turn, into 32-bit blocks. The hash values are then calculated for these blocks, also known as words. Four words are always linked to each other logically four times in different ways, moved bit-by-bit or added. The result is four 32-bit words: A, B, C, and D. The algorithm adds them to the previously calculated values A, B, C, and D. This combination of A, B, C, and D after each round is called a *status vector*.

This continues until the entire message is processed. The MD5 value is then A, B, C, and D in a row. A, B, C, and D receive a fixed output value – the initialization vector – before the calculation.

## Buy this article as PDF

(incl. VAT)