Add first draft

jamesroutley · Aug 23, 2017 · a9dfe15 · melko · Aug 23, 2017 · a9dfe15
1 parent b75c360
commit a9dfe15
Show file tree

Hide file tree

Showing 8 changed files with 796 additions and 0 deletions.
diff --git a/01-introduction/README.md b/01-introduction/README.md
@@ -0,0 +1,76 @@
+# Introduction
+
+A hash table is a data structure which offers a fast implementation of the
+associative array [API](#api). As the terminology around hash tables can be
+confusing, I've added a summary [below](#terminology).
+
+A hash table consists of an array of 'buckets', each of which stores a key-value
+pair. In order to locate the bucket where a key-value pair should be stored, the
+key is passed through a hashing function. This function returns an integer which
+is used as the pair's index in the array of buckets. When we want to retrieve a
+key-value pair, we supply the key to the same hashing function, receive its
+index, and use the index to find it in the array.
+
+Array indexing has algorithmic complexity `O(1)`, making hash tables fast at
+storing and retrieving data.
+
+Our hash table will map string keys to string values, but the principals
+given here are applicable to hash tables which map arbitrary key types to
+arbitrary value types. Only ASCII strings will be supported, as supporting
+unicode is non-trivial and out of scope of this tutorial.
+
+## API
+
+Associative arrays are a collection of unordered key-value pairs. Duplicate keys
+are not permitted. The following operations are supported:
+
+- `search(a, k)`: return the value `v` associated with key `k` from the
+  associative array `a`, or `NULL` if the key does not exist.
+- `insert(a, k, v)`: store the pair `k:v` in the associative array `a`.
+- `delete(a, k)`: delete the `k:v` pair associated with `k`, or do nothing if
+  `k` does not exist.
+
+## Setup
+
+To set up C on your computer, please consult [Daniel Holden's](@orangeduck)
+guide in the [Build Your Own
+Lisp](http://www.buildyourownlisp.com/chapter2_installation) book.  Build Your
+Own Lisp is a great book, and I recommend working through it.
+
+## Code structure
+
+Code should be laid out in the following directory structure.
+
+```
+.
+├── build
+└── src
+    ├── hash_table.c
+    ├── hash_table.h
+    ├── prime.c
+    └── prime.h
+```
+
+`src` will contain our code, `build` will contain our compiled binaries.
+
+## Terminology
+
+There are lots of names which are used interchangeably. In this article, we'll
+use the following:
+
+- Associative array: an abstract data structure which implements the
+  [API](#api) described above. Also called a map, symbol table or
+  dictionary.
+
+- Hash table: a fast implementation of the associative array API which makes
+  use of a hash function. Also called a hash map, map, hash or
+  dictionary.
+
+Associative arrays can be implemented with many different underlying data
+structures. A (non-performant) one can be implemented by simply storing items in
+an array, and iterating through the array when searching. Associative arrays and
+hash tables are often confused because associative arrays are so often
+implemented as hash tables.
+
+Next section: [Hash table structure](/hash-table)
+[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
diff --git a/02-hash-table/README.md b/02-hash-table/README.md
@@ -0,0 +1,104 @@
+# Hash table structure
+
+Our key-value pairs (items) will each be stored in a `struct`:
+
+```c
+// hash_table.h
+typedef struct ht_item {
+    char* key;
+    char* value;
+} ht_item;
+```
+
+Our hash table stores an array of pointers to items, and some details about its
+size and how full it is:
+
+```c
+// hash_table.h
+typedef struct {
+    int size;
+    int count;
+    ht_item** items;
+} ht_hash_table;
+```
+
+## Initialising and deleting
+
+We need to define initialisation functions for `ht_item`s. This function
+allocates a chunk of memory the size of an `ht_item`, and saves a copy of the
+strings `k` and `v` in the new chunk of memory. The function is marked as
+`static` because it will only ever be called by code internal to the hash table.
+
+```c
+// hash_table.c
+#include <stdlib.h>
+#include <string.h>
+
+#include "hash_table.h"
+
+static ht_item* ht_new_item(const char* k, const char* v) {
+    ht_item* i = malloc(sizeof(ht_item));
+    i->key = strdup(k);
+    i->value = strdup(v);
+    return i;
+}
+```
+
+`ht_new` initialises a new hash table. `size` defines how many items we can
+store. This is fixed at 53 for now. We'll expand this in the section on
+[resizing](/resizing). We initialise the array of items with `calloc`, which
+fills the allocated memory with `NULL` bytes. A `NULL` entry in the array
+indicates that the bucket is empty.
+
+```c
+// hash_table.c
+ht_hash_table* ht_new() {
+    ht_hash_table* ht = malloc(sizeof(ht_hash_table));
+
+    ht->size = 53;
+    ht->count = 0;
+    ht->items = calloc((size_t)ht->size, sizeof(ht_item*));
+    return ht;
+}
+```
+
+We also need functions for deleting `ht_item`s and `ht_hash_tables`, which
+`free` the memory we've allocated, so we don't cause [memory
+leaks](https://en.wikipedia.org/wiki/Memory_leak).
+
+```c
+// hash_table.c
+static void ht_del_item(ht_item* i) {
+    free(i->key);
+    free(i->value);
+    free(i);
+}
+
+
+void ht_del_hash_table(ht_hash_table* ht) {
+    for (int i = 0; i < ht->size; i++) {
+        ht_item* item = ht->items[i];
+        if (item != NULL) {
+            ht_del_item(item);
+        }
+    }
+    free(ht->items);
+    free(ht);
+}
+```
+
+We have written code which defines a hash table, and lets us create and destroy
+one. Although it doesn't do much at this point, we can still try it out.
+
+```c
+// main.c
+#include "hash_table.h"
+
+int main() {
+    ht_hash_table* ht = ht_new();
+    ht_del_hash_table(ht);
+}
+```
+
+Next section: [Hash functions](/hashing)
+[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
diff --git a/03-hashing/README.md b/03-hashing/README.md
@@ -0,0 +1,93 @@
+# Hash function
+
+In this section, we'll write our hash function. 
+
+The hash function we choose should:
+
+- Take a string as its input and return a number between `0` and `m`, our
+  desired bucket array length.
+- Return an even distribution of bucket indexes for an average set of inputs. If
+  our hash function is unevenly distributed, it will put more items in some
+  buckets than others. This will lead to a higher rate of
+  [collisions](#collisions). Collisions reduce the efficiency of our hash table.
+
+## Algorithm
+
+We'll make use of a generic string hashing function, expressed below in
+pseudocode.
+
+```
+function hash(string, a, num_buckets):
+    hash = 0
+    string_len = length(string)
+    for i = 0, 1, ..., string_len:
+        hash += (a ** string_len - (i+1)) * char_code(string[i])
+    hash = hash % num_buckets
+    return hash
+```
+
+This hash function has two steps:
+
+1. Convert the string to a large integer
+2. Reduce the size of the integer to a fixed range by taking its remainder `mod`
+   `m`
+
+The variable `a` should be a prime number larger than the size of the alphabet.
+We're hashing ASCII strings, which has an alphabet size of 128, so we should
+choose a prime larger than that. 
+
+`char_code` is a function which returns an integer which represents the
+character. We'll use ASCII character codes for this.
+
+Let's try the hash function out:
+
+```
+hash("cat", 151, 53)
+
+hash = 151**2 * 99 + 151**1 * 97 + 151**0 * 116 % 53
+hash = 2257299 + 14647 + 116 % 53
+hash = 2272062 % 53
+hash = 5
+```
+
+Changing the value of `a` give us a different hash function.
+
+```
+hash("cat", 163, 53) = 3
+```
+
+## Implementation
+
+```c
+// hash_table.c
+static int ht_hash(const char* s, const int a, const int m) {
+    long hash = 0;
+    const int len_s = strlen(s);
+    for (int i = 0; i < len_s; i++) {
+        hash += (long)pow(a, len_s - (i+1)) * s[i];
+        hash = hash % m;
+    }
+    return (int)hash;
+}
+```
+
+## Pathological data
+
+An ideal hash function would always return an even distribution. However, for
+any hash function, there is a 'pathological' set of inputs, which all hash to
+the same value. To find this set of inputs, run a large set of inputs through
+the function. All inputs which hash to a particular bucket form a pathological
+set.
+
+The existence of pathological input sets means there are no perfect hash
+functions for all inputs. The best we can do is to create a function which
+performs well for the expected data set.
+
+Pathological inputs also poses a security issue. If a hash table is fed a set of
+colliding keys by some malicious user, then searches for those keys will take
+much longer (`O(n)`) than normal (`O(1)`). This can be used as a denial of
+service attack against systems which are underpinned by hash tables, such as DNS
+and certain web services.
+
+Next section: [Handling collisions](/collisions)
+[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)
diff --git a/04-collisions/README.md b/04-collisions/README.md
@@ -0,0 +1,50 @@
+## Handling collisions
+
+Hash functions map an infinitely large number of inputs to a finite number of
+outputs. Different input keys will map to the same array index, causing
+bucket collisions. Hash tables must implement some method of dealing with
+collisions. 
+
+Our hash table will handle collisions using a technique called open addressing
+with double hashing. Double hashing makes use of two hash functions to
+calculate the index an item should be stored at after `i` collisions.
+
+For an overview of other types of collision resolution, see the
+[appendix](/07-appendix).
+
+## Double hashing
+
+The index that should be used after `i` collisions is given by:
+
+```
+index = hash_a(string) + i * hash_b(string) % num_buckets
+```
+
+We see that if no collisions have occurred, `i = 0`, so the index is just 
+`hash_a` of the string. If a collision happens, the index is modified by the
+`hash_b`.
+
+It is possible that `hash_b` will return 0, reducing the second term to 0. This
+will cause the hash table to try to insert the item into the same bucket over
+and over. We can mitigate this by adding 1 to the result of the second hash,
+making sure it's never 0.
+
+```
+index = hash_a(string) + i * (hash_b(string) + 1) % num_buckets
+```
+
+## Implementation
+
+```c
+// hash_table.c
+static int ht_get_hash(
+    const char* s, const int num_buckets, const int attempt
+) {
+    const int hash_a = ht_generic_hash(s, HT_PRIME_1, num_buckets);
+    const int hash_b = ht_generic_hash(s, HT_PRIME_2, num_buckets);
+    return (hash_a + (attempt * (hash_b + 1))) % num_buckets;
+}
+```
+
+Next section: [Hash table methods](/methods)
+[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents)