Skip to content

Fast Scalable Minimal Perfect Hash for Large Keysets

License

Notifications You must be signed in to change notification settings

opencoff/go-bbhash

Repository files navigation

GoDoc Go Report Card

go-bbhash - Fast, Scalable Minimal Perfect Hash for Large Sets

What is it?

A library to create, query and serialize/de-serialize minimal perfect hash functions over very large key sets.

This is an implementation of this paper. It is in part inspired by Damien Gryski's Boomphf - this implementation differs from Boomphf in one significant way - this library adds an efficient serialization & deserialization API.

The library exposes the following types:

  • BBHash: Represents an instance of a minimal perfect hash function as described in the paper above.
  • DBWriter: Used to construct a constant database of key-value pairs - where the lookup of a given key is done in constant time using BBHash. Essentially, this type serializes a collection of key-value pairs using BBHash as the underlying index.
  • DBReader: Used for looking up key-values from a previously constructed (serialized) database.

NOTE Minimal Perfect Hash functions take a fixed input and generate a mapping to lookup the items in constant time. In particular, they are NOT a replacement for a traditional hash-table; i.e., it may yield false-positives when queried using keys not present during construction. In concrete terms:

Let S = {k0, k1, ... kn} be your input key set.

If H: S -> {0, .. n} is a minimal perfect hash function, then H(kx) for kx NOT in S may yield an integer result (indicating that kx was successfully "looked up").

Thus, if users of BBHash are unsure of the input being passed to such a Lookup() function, they should add an additional comparison against the actual key to verify. Look at dbreader.go:Find() for an example.

How do I use it?

Like any other golang library: go get github.com/opencoff/go-bbhash.

Example Program

There is a working example of the DBWriter and DBReader interfaces in the file example/mphdb.go. This example demonstrates the following functionality:

  • add one or more space delimited key/value files (first field is key, second field is value)
  • add one or more CSV files (first field is key, second field is value)
  • Write the resulting MPH DB to disk
  • Read the DB and verify its integrity

First, lets run some tests and make sure bbhash is working fine:

  $ git clone https://github.com/opencoff/go-bbhash
  $ cd go-bbhash
  $ make test

Now, lets build and run the example program:

  $ make
  $ ./mphdb -h

There is a helper python script to generate a very large text file of hostnames and IP addresses: genhosts.py. You can run it like so:

  $ python ./example/genhosts.py 192.168.0.0/16 > a.txt

The above example generates 65535 hostnames and corresponding IP addresses; each of the IP addresses is sequentially drawn from the subnet.

NOTE If you use a "/8" subnet mask you will generate a lot of data (~430MB in size).

Once you have the input generated, you can feed it to the example program above to generate a MPH DB:

  $ ./mphdb foo.db a.txt
  $ ./mphdb -V foo.db

It is possible that "mphdb" fails to construct a DB and complains of gamma being too small. In that case, try increasing "g" like so:

  $ ./mphdb -g 2.75 foo.db a.txt

Basic Usage of BBHash

Assuming you have read your keys, hashed them into uint64, this is how you can use the library:

        bb, err := bbhash.New(2.0, keys)
        if err != nil { panic(err) }

        // Now, call Find() with each key to gets its unique mapping.
        // Note: Find() returns values in the range closed-interval [1, len(keys)]
        for i, k := range keys {
                j := bb.Find(k)
                fmt.Printf("%d: %#x maps to %d\n", i, k, j)
        }

Writing a DB Once, but lookup many times

One can construct an on-disk constant-time lookup using BBHash as the underlying indexing mechanism. Such a DB is useful in situations where the key/value pairs are NOT changed frequently; i.e., read-dominant workloads. The typical pattern in such situations is to build the constant-DB once for efficient retrieval and do lookups multiple times.

Step-1: Construct the DB from multiple sources

For example, let us suppose that file a.txt and b.csv have lots of key,value pairs. We will build a constant DB using this.

    wr, err := bbhash.NewDBWriter("file.db")
    if err != nil { panic(err) }

    // add a.txt and a.csv to this db

    // txt file delimited by white space;
    // first token is the key, second token is the value
    n, err := wr.AddTextFile("a.txt", " \t")
    if err != nil { panic(err) }
    fmt.Printf("a.txt: %d records added\n", n)

    // CSV file - comma delimited
    // lines starting with '#' are considered comments
    // field 0 is the key; and field 1 is the value.
    // The first line is assumed to be a header and ignored.
    n, err := wr.AddCSVFile("b.csv", ',', '#', 0, 1)
    if err != nil { panic(err) }
    fmt.Printf("b.csv: %d records added\n", n)

    // Now, freeze the DB and write to disk.
    // We will use a larger "gamma" value to increase chances of
    // finding a minimal perfect hash function.
    err = wr.Freeze(3.0)
    if err != nil { panic(err) }

Now, file.db has the key/value pairs from the two input files stored in an efficient format for constant-time retrieval.

Step-2: Looking up Key in the DB

Continuing the above example, suppose that you want to use the constructed DB for repeated lookups of various keys and retrieve their corresponding values:

    // read 'file.db' and cache upto 10,000
    // records in memory.
    rd, err := bbhash.NewDBReader("file.db", 10000)
    if err != nil { panic(err) }

Now, given a key k, we can use rd to lookup the corresponding value:

    val, err := rd.Find(k)

    if err != nil {
        if err == bbhash.ErrNoKey {
            fmt.Printf("Key %x is not in the DB\n", k)
        } else {
            fmt.Printf("Error: %s\n", err)
        }
    }

    fmt.Printf("Key %x => Value %x\n", k, val)

Implementation Notes

  • For constructing the BBHash, keys are uint64; the DBWriter implementation uses Zi Long Tan's superfast hash function to transform arbitary bytes to uint64.

  • The perfect-hash index for each key is "1" based (i.e., it is in the closed interval [1, len(keys)].

License

GPL v2.0