Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

Sharding IPLD objects #76

Open
jbenet opened this issue Dec 2, 2015 · 24 comments
Open

Sharding IPLD objects #76

jbenet opened this issue Dec 2, 2015 · 24 comments
Labels

Comments

@jbenet
Copy link
Member

jbenet commented Dec 2, 2015

This note gathers the constraints + will drive toward a design of object sharding in IPFS and IPLD. Object sharding is the algorithms and formats used to represent a single (virtual) large object out of many smaller ones. Think of this like the way large directories are represented in modern filesystems (RB-Trees, B-Trees, HTrees, etc).

Sharding IPLD objects in general is a useful thing. instead of implementing it for unixfs and other datastructs each time, we could implement it once. it could be a datastruct the others employ, or maybe -- if it is simple enough -- it belongs as part of IPLD itself.

Constraints to support:

  • efficient in the small case (1 to 5 nodes)
  • allows user-chosen sharding (eg for small numbers of nodes, may want specific construction)
  • large fanouts (millions or billions)
  • efficient access
  • minimize insertion re-writes (shadowing/cloning)
  • upgradeable algorithms (can signal which sharding algo via version, or even with a key/val)
  • union style fanouts
  • hierarchical style fanouts (patricia tries)

For large fanouts, look at


case for supporting it on-top of IPLD

  • It is nice that the IPLD spec is very simple. Finding a nice way to support this without complicating it much will be hard-- the constraints above do not bode well for this.
  • can define it as a different datastruct, should not be hard for other datastructs to extend it
  • flexible algorithms for sharding may complicate IPLD

case for supporting it in IPLD

  • we could have a very powerful datastructure if sharding came everywhere
  • merkle-linking in IPLD is already like hierarchical fanout sharding of a single massive tree, this is just sharding within a single level.
  • IPLD already has flexible algos in multicodec
  • could use a directive like @shard or something
  • could be an IPLD extension if not properly in core spec.

cc @whyrusleeping @lgierth @diasdavid @cryptix @ion1 @mildred @tv42 @wking

@ion1
Copy link

ion1 commented Dec 3, 2015

It might be useful to take advantage of chunking by a rolling hash with large flat data structures receiving small arbitrary changes, such as a directory.

@davidar
Copy link
Member

davidar commented Dec 17, 2015

Radix trees have the benefit of being deterministic (invariant to the order of insertions), which helps with deduplication.

Edit: also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452

@jbenet
Copy link
Member Author

jbenet commented Jan 19, 2016

cc @whyrusleeping comment here.

@jbenet
Copy link
Member Author

jbenet commented Jan 19, 2016

From the IPLD spec, one very trivial but extensible thing to do all-IPLD-object-wide is something like this:

{
  "@extends": [
    "/ipfs/<hash1>",
    "/ipfs/<hash2>",
    "/ipfs/<hash3>",
  ],
  ... // more properties
}

which would merge the objects like so:

o1 = get(/ipfs/<hash1>)
o2 = get(/ipfs/<hash2>)
o3 = get(/ipfs/<hash3>)
o4 = get(/ipfs/<hash4>) // our object above

merged = _.extends(o1, o2, o3, o4)

meaning that we start with the first, and we patch in order. so a property in o4 replaces those preceding it. this is very simple and can likely be used to support the desires mentioned earlier in this issue (more complicated structures).


It would be useful to have examples of directory sharding with:

  • B-trees + shadowing + clones
  • a Radix tree

expressed as the corresponding JSON or YML syntax for IPLD. (i use JSON above because it tends to be clearer, but i prefer YML). that way we can see what it takes to support it natively in IPLD.

@davidar
Copy link
Member

davidar commented Jan 19, 2016

Personally I'd like it if this was a core part of IPLD, so that large objects were transparently sharded without the user having to explicitly do anything (much like the transparent merkle-linking). Here's an example of a manual radix tree though:

r:
  om:
    an:
      e: 1
      us: 2
    ulus: 3
  ub:
    e:
      ns: 4
      r: 5
    ic:
      on: 6
      undus: 7

Sub-trees can be split into separate (merkle-linked) objects, analogously to what we already do with chunking linear streams.

@jbenet
Copy link
Member Author

jbenet commented Jan 19, 2016

@mildred
Copy link

mildred commented Jan 19, 2016

I think this should not be implemented below the basic IPLD data model: you want, depending on the application, to control where to split objects. For example, metadata should be easily accessible while the complete file data could take a little more time to load.

I think this should be an extension of IPLD. And you could have the choice to:

  • fetch only the requested object
  • fetch the logical record composed of multiple objects reassembled into a bigger one

It might be necessary to think how exactly we are going to reassemble the objects. Which algorithms. In particular. Especially considering that's not just about tree merging, but also data string merging.

For example, an object might be composed of a huge binary data string (for example the object representing a huge file). This data string shoud be able to be split among different physical IPLD objects.

The way I would implement it is to have index objects that links to sub parts, and specify with the link the way they are merged. Something like jbenet suggested

@davidar I disagree with you when you say that it should be completely transparent. The application should have control over the splitting of files. We could have a library that does a good job of splitting when the application author doesn't want to bother, but we should be able to have full control.

@jbenet
Copy link
Member Author

jbenet commented Jan 24, 2016

Agree in full with @mildred

@jbenet
Copy link
Member Author

jbenet commented Jun 22, 2016

Some links for references.

A tricky problem here is tuning the trade off between "good datastructures that create very small nodes" and "good sized objects to minimize the hashing costs, cache misses, and network ops". Basically, there's sharding, which represents one object with many small ones. and then there's aggregation, to represent many small objects with a big one. Most good approaches to sharding from data struct literature yield excellent datastructs with many small objects, which we'd want to aggregate. aggregation is tricky, because making it stable (convergent, for hashing) and efficient (minimize update costs) is challenging. Aggregation is not straight forward; because "bundling many objects in the underlying transport instead of in the data model" is not enough, as that does not minimize hashing and random access advertisements.

@jbenet
Copy link
Member Author

jbenet commented Jun 22, 2016

More on HAMTs

@whyrusleeping
Copy link
Member

@jbenet patricia trees would be cool. But figuring out how to aggregate them in a way that is stable after random insertion orders is going to be very tricky.

The HAMT sounds a lot like what you were hacking on in SF, except for the part about counting bits for non-nil pointers. With that trick, we avoid having hugely bloated nodes which was my primary complaint with your previous idea.

@whyrusleeping
Copy link
Member

looking at implementation details for the HAMT and found this absurd 'population count' implementation: https://play.golang.org/p/U7SogJ7psJ

@jbenet
Copy link
Member Author

jbenet commented Jun 29, 2016

Yeah HAMT is the way to go. i love when we find the trails of other researchers and we realize they've been paved pretty well, and were after exactly the same thing.

@jbenet
Copy link
Member Author

jbenet commented Jun 29, 2016

btw i pushed the js thing i was making: https://github.com/jbenet/ipld-ht-experiment

here were some interesting questions raised:

Parameters

  • the max bucket size (fanout) should be configurable, and stored in the HT root node.
  • likely HAMT params should be configurable

Paging

  • keeping nodes in memory is a tough choice. will want to amortize loading things. and will also want to "page things out".
  • we're going to run into this everywhere.
  • ex: mfs pages out files, but maybe keeps dirs in memory for editing. what happens when the dir hierarchy is too big? will need to page out too.
  • we should be able to use these algos and restrict their memory usage, and have them be able to respond to memory pressure (on proc, and just programmatically) -- suggests a sub-proc abstraction.
  • at the very least we should make sure to address in the algo specs how to deal with memory pressure how to:
    • page in (load)
    • page out (should just be release for gc)

Writes

  • writing for all persistent datastructures involves:
    • bubbling writes (bubbling up to the root)
    • creating many intermediate objects
    • many of which are never pointed to by any externalized root. (meaning the objects are never accessible, so writing them is strictly a waste of resources.)
  • algo specs should describe how to do:
    • sync writes (that write all new objects to ipfs)
    • in memory only writes (avoid writing to ipfs until datastruct is finalized)
    • and coalesced writes (write, but defer writing when under heavy editing pressure)

Hashing

use non-cryptographic hash functions

  • other latencies in the read/write pipeline may obviate this
  • however, once a datastructure is paged in, reads should be able to be very fast

use seeds in children/intermediate HT nodes deterministic to their position in the tree/branch.

  • HAMT does not need this because it uses the same hash from the root.

  • but note the HAMT papers say "when run out of hash bits, get more hash bits".

  • this just means extending the hash function for the given seed in a deterministic way.

  • example 1:

    • good: hash(name, n) => hashfn(n + name) (start this way from 0).
    • bad: hash(name) => hashfn(name), then later, n > 0: hash(n + seed).
  • example 2 (good):

    hash(name, n) => {
      if (n > 0) 
        name = hash(name, n-1)
      return hashfn(name)
    }
    
    // hash(name, 0) = hashfn(name)
    // hash(name, 1) = hashfn(hashfn(name))
    // hash(name, 2) = hashfn(hashfn(hashfn(name)))
    
  • @whyrusleeping confirm all this made sense? o/

Differentiating Leaves

  • algos should specify how to differentiate leaves. (i think HAMT does)

  • one easy way in IPLD objects though is this https://github.com/jbenet/ipld-ht-experiment/blob/master/rht.js#L137-L150 -- assume nodes that are just IPLD links are always intermediate nodes, and leaves are a nested object (which carry the name/key being inserted). eg

    {
      // an intermediate datastructure node.
      children: [
        {"/": "Qmaaa..."}, // just a link. intermediate
        {"/": "Qmbbb..."}, // just a link. intermediate
        {"/": "Qmccc..."}, // just a link. intermediate
        {"n": "key", "v": {"/": Qmvalue}}, // leaf node (value entry)
        {"/": "Qmeee..."}, // intermediate
      ]
    }

@whyrusleeping
Copy link
Member

@jbenet should the names on links pointing to the intermediate nodes just be their index? I think that makes the most sense.

@jbenet
Copy link
Member Author

jbenet commented Jul 4, 2016

@whyrusleeping not sure what the index buys you, but sure, that works for the protobuf version. but not the IPLD version.

I think actually that we should make ONE version (IPLD naturally), and project it into protobuf, so that the things are interoperable with IPLD from the get go, and not have two different HAMT implementations.

@jbenet
Copy link
Member Author

jbenet commented Jul 8, 2016

Great article on map data structures for external memory — i.e. ways of storing indexed data that are optimized for storage on disk (or networks :) ).

@whyrusleeping
Copy link
Member

@jbenet How does a sharded directory work with paths? Should paths be resolved in the context of unixfs? or strictly with merkledag links?

i.e. a sharded node has links:

{
  Links: {
    "AB" : { 'child shard' },
    "FE" : { 'child shard' },
    "12" : { // this is a child shard too, just showing it for the example
      "Links" : {
         "D4ActualDirName" : "dirhash",
      }
    }
}

this would be a sharded directory that contains (among other things) a directory named "ActualDirName". To access this, I'd expect to be able to do /ipfs/<shardhash>/ActualDirName, but /ipfs/<shardhash>/12/D4ActualDirName is the actual path through the raw DAG.

@jbenet
Copy link
Member Author

jbenet commented Aug 6, 2016

unixfs should allow resolving the virtual dag transparently.

@jbenet
Copy link
Member Author

jbenet commented Aug 6, 2016

@daviddias
Copy link
Member

And update on efficient immutable collections that takes on HAMT and CHAMT:

https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf

HN discussion

@Kubuxu
Copy link
Member

Kubuxu commented Aug 15, 2017

Getting HAMT into IPLD would be major step in increasing its usability. It can be done in two ways: by introducing it at resolver layer (it would be transparent), or buy writing out clear spec and few reference implementations on application layer.

I was helping @magik6k to work on Wikipedia Search using IPLD, he had to build his own hashmap structure (which was static and needed full rebuild everything, also parameters weren't autotuned).
This increased time needed for this project significantly.

@Kubuxu
Copy link
Member

Kubuxu commented Aug 15, 2017

Also AFAIK we already use partially use principles of CHAMT as we don't allocate full array, and we use bitmap masking and hashing.

@daviddias
Copy link
Member

@Kubuxu agreed, I think the direction that everyone wants to is have two layers of IPLD, the first being the one we implemented (The resolver) and the second one that understands things like sharding and other data structs for custom path handling. I know that a few people put a lot of thought into this and we ended up deciding to implement the base layer first, now it seems the right time to start figuring out that second layer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants