Skip to content

Commit

Permalink
Add first draft
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesroutley committed Aug 23, 2017
1 parent b75c360 commit a9dfe15
Show file tree
Hide file tree
Showing 8 changed files with 796 additions and 0 deletions.
76 changes: 76 additions & 0 deletions 01-introduction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Introduction

A hash table is a data structure which offers a fast implementation of the
associative array [API](#api). As the terminology around hash tables can be
confusing, I've added a summary [below](#terminology).

A hash table consists of an array of 'buckets', each of which stores a key-value
pair. In order to locate the bucket where a key-value pair should be stored, the
key is passed through a hashing function. This function returns an integer which
is used as the pair's index in the array of buckets. When we want to retrieve a
key-value pair, we supply the key to the same hashing function, receive its
index, and use the index to find it in the array.

Array indexing has algorithmic complexity `O(1)`, making hash tables fast at
storing and retrieving data.

Our hash table will map string keys to string values, but the principals
given here are applicable to hash tables which map arbitrary key types to
arbitrary value types. Only ASCII strings will be supported, as supporting
unicode is non-trivial and out of scope of this tutorial.

## API

Associative arrays are a collection of unordered key-value pairs. Duplicate keys
are not permitted. The following operations are supported:

- `search(a, k)`: return the value `v` associated with key `k` from the
associative array `a`, or `NULL` if the key does not exist.
- `insert(a, k, v)`: store the pair `k:v` in the associative array `a`.
- `delete(a, k)`: delete the `k:v` pair associated with `k`, or do nothing if
`k` does not exist.

## Setup

To set up C on your computer, please consult [Daniel Holden's](@orangeduck)
guide in the [Build Your Own
Lisp](http://www.buildyourownlisp.com/chapter2_installation) book. Build Your
Own Lisp is a great book, and I recommend working through it.

## Code structure

Code should be laid out in the following directory structure.

```
.
├── build
└── src
├── hash_table.c
├── hash_table.h
├── prime.c
└── prime.h
```

`src` will contain our code, `build` will contain our compiled binaries.

## Terminology

There are lots of names which are used interchangeably. In this article, we'll
use the following:

- Associative array: an abstract data structure which implements the
[API](#api) described above. Also called a map, symbol table or
dictionary.

- Hash table: a fast implementation of the associative array API which makes
use of a hash function. Also called a hash map, map, hash or
dictionary.

Associative arrays can be implemented with many different underlying data
structures. A (non-performant) one can be implemented by simply storing items in
an array, and iterating through the array when searching. Associative arrays and
hash tables are often confused because associative arrays are so often
implemented as hash tables.

Next section: [Hash table structure](/hash-table)
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
104 changes: 104 additions & 0 deletions 02-hash-table/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Hash table structure

Our key-value pairs (items) will each be stored in a `struct`:

```c
// hash_table.h
typedef struct ht_item {
char* key;
char* value;
} ht_item;
```

Our hash table stores an array of pointers to items, and some details about its
size and how full it is:

```c
// hash_table.h
typedef struct {
int size;
int count;
ht_item** items;
} ht_hash_table;
```

## Initialising and deleting

We need to define initialisation functions for `ht_item`s. This function
allocates a chunk of memory the size of an `ht_item`, and saves a copy of the
strings `k` and `v` in the new chunk of memory. The function is marked as
`static` because it will only ever be called by code internal to the hash table.

```c
// hash_table.c
#include <stdlib.h>
#include <string.h>

#include "hash_table.h"

static ht_item* ht_new_item(const char* k, const char* v) {
ht_item* i = malloc(sizeof(ht_item));
i->key = strdup(k);
i->value = strdup(v);
return i;
}
```
`ht_new` initialises a new hash table. `size` defines how many items we can
store. This is fixed at 53 for now. We'll expand this in the section on
[resizing](/resizing). We initialise the array of items with `calloc`, which
fills the allocated memory with `NULL` bytes. A `NULL` entry in the array
indicates that the bucket is empty.
```c
// hash_table.c
ht_hash_table* ht_new() {
ht_hash_table* ht = malloc(sizeof(ht_hash_table));
ht->size = 53;
ht->count = 0;
ht->items = calloc((size_t)ht->size, sizeof(ht_item*));
return ht;
}
```

We also need functions for deleting `ht_item`s and `ht_hash_tables`, which
`free` the memory we've allocated, so we don't cause [memory
leaks](https://en.wikipedia.org/wiki/Memory_leak).

```c
// hash_table.c
static void ht_del_item(ht_item* i) {
free(i->key);
free(i->value);
free(i);
}


void ht_del_hash_table(ht_hash_table* ht) {
for (int i = 0; i < ht->size; i++) {
ht_item* item = ht->items[i];
if (item != NULL) {
ht_del_item(item);
}
}
free(ht->items);
free(ht);
}
```
We have written code which defines a hash table, and lets us create and destroy
one. Although it doesn't do much at this point, we can still try it out.
```c
// main.c
#include "hash_table.h"
int main() {
ht_hash_table* ht = ht_new();
ht_del_hash_table(ht);
}
```

Next section: [Hash functions](/hashing)
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
93 changes: 93 additions & 0 deletions 03-hashing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Hash function

In this section, we'll write our hash function.

The hash function we choose should:

- Take a string as its input and return a number between `0` and `m`, our
desired bucket array length.
- Return an even distribution of bucket indexes for an average set of inputs. If
our hash function is unevenly distributed, it will put more items in some
buckets than others. This will lead to a higher rate of
[collisions](#collisions). Collisions reduce the efficiency of our hash table.

## Algorithm

We'll make use of a generic string hashing function, expressed below in
pseudocode.

```
function hash(string, a, num_buckets):
hash = 0
string_len = length(string)
for i = 0, 1, ..., string_len:
hash += (a ** string_len - (i+1)) * char_code(string[i])
hash = hash % num_buckets
return hash
```

This hash function has two steps:

1. Convert the string to a large integer
2. Reduce the size of the integer to a fixed range by taking its remainder `mod`
`m`

The variable `a` should be a prime number larger than the size of the alphabet.
We're hashing ASCII strings, which has an alphabet size of 128, so we should
choose a prime larger than that.

`char_code` is a function which returns an integer which represents the
character. We'll use ASCII character codes for this.

Let's try the hash function out:

```
hash("cat", 151, 53)
hash = 151**2 * 99 + 151**1 * 97 + 151**0 * 116 % 53
hash = 2257299 + 14647 + 116 % 53
hash = 2272062 % 53
hash = 5
```

Changing the value of `a` give us a different hash function.

```
hash("cat", 163, 53) = 3
```

## Implementation

```c
// hash_table.c
static int ht_hash(const char* s, const int a, const int m) {
long hash = 0;
const int len_s = strlen(s);
for (int i = 0; i < len_s; i++) {
hash += (long)pow(a, len_s - (i+1)) * s[i];
hash = hash % m;

This comment has been minimized.

Copy link
@melko

melko Aug 23, 2017

From the pseudocode written before, this should be outside the for loop

}
return (int)hash;
}
```
## Pathological data
An ideal hash function would always return an even distribution. However, for
any hash function, there is a 'pathological' set of inputs, which all hash to
the same value. To find this set of inputs, run a large set of inputs through
the function. All inputs which hash to a particular bucket form a pathological
set.
The existence of pathological input sets means there are no perfect hash
functions for all inputs. The best we can do is to create a function which
performs well for the expected data set.
Pathological inputs also poses a security issue. If a hash table is fed a set of
colliding keys by some malicious user, then searches for those keys will take
much longer (`O(n)`) than normal (`O(1)`). This can be used as a denial of
service attack against systems which are underpinned by hash tables, such as DNS
and certain web services.
Next section: [Handling collisions](/collisions)
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
50 changes: 50 additions & 0 deletions 04-collisions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## Handling collisions

Hash functions map an infinitely large number of inputs to a finite number of
outputs. Different input keys will map to the same array index, causing
bucket collisions. Hash tables must implement some method of dealing with
collisions.

Our hash table will handle collisions using a technique called open addressing
with double hashing. Double hashing makes use of two hash functions to
calculate the index an item should be stored at after `i` collisions.

For an overview of other types of collision resolution, see the
[appendix](/07-appendix).

## Double hashing

The index that should be used after `i` collisions is given by:

```
index = hash_a(string) + i * hash_b(string) % num_buckets
```

We see that if no collisions have occurred, `i = 0`, so the index is just
`hash_a` of the string. If a collision happens, the index is modified by the
`hash_b`.

It is possible that `hash_b` will return 0, reducing the second term to 0. This
will cause the hash table to try to insert the item into the same bucket over
and over. We can mitigate this by adding 1 to the result of the second hash,
making sure it's never 0.

```
index = hash_a(string) + i * (hash_b(string) + 1) % num_buckets
```

## Implementation

```c
// hash_table.c
static int ht_get_hash(
const char* s, const int num_buckets, const int attempt
) {
const int hash_a = ht_generic_hash(s, HT_PRIME_1, num_buckets);
const int hash_b = ht_generic_hash(s, HT_PRIME_2, num_buckets);
return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
}
```

Next section: [Hash table methods](/methods)
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
Loading

0 comments on commit a9dfe15

Please sign in to comment.