Hashing And Hash Tables

Table of Contents

Hashing and Hash Functions

Suppose you wanted a short but unique identifier for a file or dataset. That's the idea behind hash…

A hash function takes data (like a string, or a file's contents) and outputs a hash, a fixed-size string number.

For example, here's the MD5 hash (MD5 is a common hash function) for a file simply containing "cake":

DF7CE038E2FA96EDF39206F898DF134D

And here's the hash for the same file after it was edited to be "cakes":

0E9091167610558FDAE6F69BD6716771

Notice the hash is completely different, even though the files were similar. Here's the hash for a long film I have on my hard drive:

664f67364296d08f31aec6fea4e9b83f

The hash is the same length as my other hashes, but thistime it represents a much bigger file – 461Mb.

We can think of a hash as a "fingerprint." We can trust that a given file will always have the same hash, but we can't go from the hash back to the original file. Sometimes we have to worry about multiple files having the same hash value, which is called a hash collision .

Some uses for hashing:

  1. Dictionaries. Suppose we want a list-like data structure with constant-time lookups, but we want to look up values based on arbitrary "keys," not just sequential "indices." We could allocate a list, and use a hash function to translate keys into list indices. That's the basic idea behind a dictionary!
  2. Preventing man-in-the-middle attacks. Ever notice those things that say "hash" or "md5" or "sha1" on download sites? The site is telling you, "We hashed this file on our end and got this result. When you finish the download, try hashing the file and confirming you get the same result. If not, your internet service provider or someone else might have injected malware or tracking sofrware into your download!"

Hash Table

Quick reference

A hash table organizes data so you can quickly look up values for a given key.

Strengths:

  • Fast lookups. Lookups take O(1) time on average.
  • Flexible keys. Most data types can be used for keys, as long as they're hashable.

Weaknesses

  • Slow worst-case lookups. Lookups take O(n) time in the worst case.
  • Unordered. Keys aren't stored in a special order. If you're looking for the smallest key, the largest key, or all the keys in a range, you'll need to look through every key to find it.
  • Single-directional lookups. While you can look up the value for a given key in O(1) time, looking up the keys for a given value requires looping through the whole dataset – O(n) time.
  • Not cache-friendly. Many hash table implementations use linked lists, which don't put data next to each other in memory.
  Average Worst Case
space O(n) O(n)
insert O(1) O(n)
lookup O(1) O(n)
delete O(1) O(n)

In Python 3.6

In Python 3.6, hash tables are called dictionaries.

light_bulb_to_hours_of_light - {
    'incandescent': 1200,
    'compact fluorescent': 10000,
    'LED': 50000,
}

Hash maps are built on arrays

Arrays are pretty similar to hash maps already. Arrays let you quickly look up the value for a given "keys"…except the keys are called "indices," and we don't get to pick them – they're always sequential integers (0, 1, 2, 3, etc).

Think of a hash map as a "hack" on top of an array to let us use flexible keys instead of being stuck with sequential integer "indices."

All we need is a function to convert a key into an array index (an integer). That function is called a hashing function.

To look up the value for a given key, we just run the key through our hashing function to get the index to go to in our underlying array to grab the value.

How does that hashing function work? There are a few different approaches, and they can get pretty complicated. But here's a simple proof of concept:

Grab the number value for each character and add those up.

The result is 429. But what if we only have 30 slots in our array? we'll use a common trick for forcing a number into a specific range: the modulus operator (%). Modding our sum by 30 ensures we get a whole number that's less than 30 (and at least 0):

429 % 30 = 9

The hashing functions used in modern systems get pretty complicated – the one we used here is a simplified example.

Hash collisions

What if two keys hash to the same index in our array? In our example above, look at "lies" and "foes":

They both sum up to 429! So of course they'll have the same answer when we mod by 30:

429 % 30 = 9

This is called a hash collision . There are a few different strategies for dealing with them.

Here's a common one: instead of storing the actual values in our array, let's have each array slot hold a pointer to a linked list holding the values for all the keys that hash to that index:

Notice that we included the keys as well as the values in each linked list node. Otherwise we wouldn't know which key was for which value!

There are other ways to deal with hash collisions. This is just one of them.

When hash table operations cost O(n) time

1. Hash collisions

If all our keys caused hash collisions, we'd be at risk of having to walk through all of our values for a single lookup (in the example above, we'd have one big linked list). This is unlikely, but it could happen. That's the worst case.

2. Dynamic array resizing

Suppose we keep adding more items to our hash map. As the number of keys and values in our hash map exceeds the number of indices in the underlying array, hash collisions become inevitable.

To mitigate this, we could expand our underlying array whenever things start to get crowded. That requires allocating a larger array and rehashing all of our existing keys to figure out their new position – O(n) time.

Sets

A set is like a hash map except it only stores keys, without values.

Sets often come up when we're tracking groups of items – nodes we've visited in a graph, characters we've seen in a string, or colors used by neighboring nodes. Usually, we're interested in whether something is in a set or not.

Sets are usually implemented very similarly to hash maps – using hashing to index into an array – but they don't have to worry about storing values alongside keys. In Python, the set implementation is largely copied from the dictionary implementation.

light_bulbs = set()

light_bulbs.add('incandescent')
light_bulbs.add('compact fluorescent')
light_bulbs.add('LED')

'LED' in light_bulbs            # True
'halogen' in light_bulbs        # False

Date: 2020-01-30 Thu 17:09

Author: Jack Liu

Created: 2020-02-08 Sat 21:26

Validate