Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463

kevina · 2016-12-03T06:38:40Z

When we switch to Cidv1 the assumptions we built into the 5 (base 32) digit prefix:

	// 5 bytes of prefix gives us 25 bits of freedom, 16 of which are taken by
	// by the Qm prefix. Leaving us with 9 bits, or 512 way sharding

are no longer correct, in fact when raw leaves are used most of the blocks will be clustered in the AFZBE directory. See #3462. There are several ways to fix this:

One way to fix this is to increase prefix length, but then we we be back to the problem we had before where we will have lots of directories with just a few files in them (when CidV0 is used).
Another alternative is proposed in Consider using the last N digits instead of the first N for sharding. go-ds-flatfs#5. That is use the last N digits (which should likely be a 2 digit suffix, which will give us 1024 way sharding).
Finally, and the most work, it to stick with using a prefix but implement a dynamic prefix width.

I think (2) would be the best alternative until (3) can be implemented.

The text was updated successfully, but these errors were encountered:

kevina · 2016-12-05T18:20:13Z

@whyrusleeping I can likely implement the last two digits quickly, the dynamic solution will likely take more work and I don't think it should be for 0.4.5.

kevina · 2016-12-05T18:39:58Z

Also as a followup from the sprint meeting: The flatfs datastore doesn't even implement range queries, so I don't think this will be an issue at all. Once we figure out how to properly implement something dynamic than we can revisit the issue of range queries.

Kubuxu · 2016-12-05T19:00:37Z

I agree with that dynamic solution might be too complex to tackle for 0.4.5.

jbenet · 2016-12-06T23:11:20Z

the problem we had before where we will have lots of directories with just a few files in them

How big of a problem is this? has anyone quantified this with a graph on read latency overhead? particularly for a node under lots of activity?

I dont think we should do (2). you're shifting the problem to an unconventional thing, without telling users that this has shifted. i may be fine if we add a file called .ipfs/blocks/_README that says something like:

This is a repository of IPLD objects. Each IPLD object is in a single file,
named `<hash-of-object>.ipld`. All the object files are placed in a tree
of directories, based on a function on the object hashes. This is a form 
of sharding similar to the objects directory in git repositories. Previously, 
we used prefixes. Now, we use suffixes. The exact function is:

    func FilePath(objectHash string) {
      return string.Split(string(objectHash[objectHash.length - 6 : ]), "/")
    }

For example, an object that hashes to:

    <hash of object>

Will be placed at

    a/b/c/objhashabc.ipld

the example should be real
the function should be correct (i sketched it)

I think this file o/ would make me feel ok about changing the function that figures out the location of the content.

jbenet · 2016-12-06T23:13:04Z

We should add such a file to signal anyway.
- add file that describes what's going on.
I think (1), then (3). but fine with (2), then (3).

kevina · 2016-12-06T23:29:10Z

@jbenet, the read latency overhead is really dependent on the filesystem. In addition some file systems (such as ext4 when dir_index is enabled) start to have a problem when the directory gets too large. I will get some numbers for you on why (1) would be a problem now (and wasn't before) in a bit.

jbenet · 2016-12-06T23:31:52Z

its fine-- just go with (2) and add that file.

kevina · 2016-12-06T23:33:10Z

@jbenet, okay will do. Thanks!

kevina · 2016-12-06T23:37:41Z

@jbenet @whyrusleeping should we just modify go-ds-flatfs to use suffix or create a new version and call it maybe go-ds-flatfs-suffix? (Modifying it too do both will likely create a lot of a special case code I would like to avoid).

whyrusleeping · 2016-12-06T23:39:40Z

@kevina make the sharding selector a function so we can specify on the creation of the flatfs datastore. This will make migrations so much easier for me.

kevina · 2016-12-06T23:47:37Z

@whyrusleeping All right I will see if I can make that work.

I am also willing to handle most of the migration.

kevina · 2016-12-07T03:24:22Z

See ipfs/go-ds-flatfs#6 for a start. The README will be generated by go-ipfs not flatfs as it is go-ipfs specific.

kevina · 2016-12-08T03:59:12Z

Because we use Base32 encoding the very last byte/digit is not quite as random as I would hope and also depends on the length of the binary representation:

Length of Key in Bits	Bits in last byte
8	3
16	1
24	4
32	2
40	5
...	...

When converting the repo created for #3462 that has a mixture of CidV0 and CidV1 raw leaves a 2 digit suffix I got me 256 levels of shading (instead of the expected 1024) and I didn't check but the distribution is unlikely to be even. With additional Cid types this number could go up to 1024 directories depending on the length of the key.

Increasing the suffix to 3 digits will increase the number of directories to 8192, but that could go up to 32768 directories.

Another possibility (that will be easy for me to implement based on what I got now), is disregard the very last digit and take the next to last 2 digits, then we get an evenly distributed 1024 way sharding. For example if the key is AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY the directory used will be CA, (instead of AY or CAY).

@whyrusleeping what do you think.

Sorry for this oversight.

kevina · 2016-12-08T07:05:20Z

Okay I ran some simulations using the keys from the repo created in #3462. I confirmed that using a fixed prefix just won't work when using both CidV0 and CidV1 keys, to make it long enough to handle CidV1 you end with lots of directories with only a few files in them when CidV0 is used. The following options could work:

How	Data Set	Num Dirs	Max Dirs
Suffix length 2	CidV0	128	1024
Suffix length 2	CidV0 and Raw Leaves	256	1024
Next to Last 2	CidV0	1024	1024
Next to Last 2	CidV0 and Raw Leaves	1024	1024
Suffix length 3	CidV0	4096	32768
Suffix length 3	CidV0 and Raw Leaves	8092	32768

The distribution in all cases is reasonable even (I was wrong in my initial guess) in all cases.

Here is the list of keys used and various results: https://gateway.ipfs.io/ipfs/QmNbugVGQDcMpDYezAELasskUWyPms9JAjFBRYuJcZtqFC

kevina · 2016-12-08T07:10:30Z

I would either go with the suffix length of 2 or use the next to last 2 digits. A suffix length of 3 will create two many directories for an average sized repo, especially if they are keys that have a length that is a multiple of 5 binary bytes (in which case there could be up to 32768 directories).

We could also make the suffix length a configurable option with the default value being 2. I am not sure what additional complications this will bring though.

whyrusleeping · 2016-12-08T19:49:47Z

Ah, so the last character in the base32 encoded keys don't contain an entire 5 bits of data. Given thats the case, lets go ahead and use a suffix length of three.

kevina · 2016-12-08T20:04:42Z

@whyrusleeping are you sure? That can lead to a large numbers of directives with only a few a few files per directory for an average sized repo. If you don't like my next-to-last idea, I would stick with 2 or make it configurable.

whyrusleeping · 2016-12-08T20:06:37Z

yep

Kubuxu · 2016-12-14T17:03:22Z

Other idea is, how about hashes of the CID, with some fast or extremely fast hashing algorithm.
Fast:

Murmur3
Blake2b/s

Extremely fast, http://www.cse.yorku.ca/~oz/hash.html:

djb2
sdbm

kevina · 2016-12-14T19:13:49Z

I am not sure I like the idea of hashing again @Kubuxu as it would make it difficult for users to manually find where a key is. I would still prefer the next to last 2 digits as that will give a perfect 1024 way fanout no matter what the key is and can still make it possible to manually find the key. I can live with the last 3 digits though.

Kubuxu · 2016-12-14T19:45:26Z

Using last digits (or N from last M) will already be quite hard so I don't think that manual user usability is the first priority.

Kubuxu · 2016-12-16T00:09:50Z

I am not against last X or next to last X, but if next to last X gives us better distribution I think we should go for it.

daviddias · 2016-12-16T00:54:14Z

Just for clarification, when the word digit is used, what we mean is char, correct?

Hashing again to increase the distribution would bite us in terms of inspection and would probably force us to add a multihash in from of each block to say which second hash is used, therefore changing the blocks.

Dynamic prefix width would be that we would have to have a way to signal that, or to do several reads.

AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY
I can easily determine that the last 3 digits are CAY and the next-to-last 2 digits are CA I don't see what is hard about tha

Why not just pick the last 3 if the last char is not as random? Does this create 'too much' sharding?

Update from @whyrusleeping:

The reason for not using 'last 3' is because it could result in having 32k directories under the blocks dir
its not not random
its not uniformly random

I would really appreciate that this PR followed a proposed changed in the spec and even better, provide in the spec repo some repo examples with a bunch of blocks in CIDv0 and CIDv1 space that you use for tests so that other implementations can do too.

Kubuxu · 2016-12-16T01:05:08Z

Second hashing would be only choosing which directory-shard the key falls in, there in no need for multihash there (it can be changed by migration or be flat-fs specific).

chooseShard(key) = djb2(key)[:2]

or something

Lengths and shards, see: #3463 (comment)

The flat-fs sharding is not speced out anywhere.

kevina · 2016-12-16T01:35:36Z

@diasdavid yes by digit I mean a char in a base 32 representation sorry for the confusing terminology. I was trying to avoid byte to distinguish it from a byte in the raw binary representation.

daviddias · 2016-12-16T02:24:33Z

The next to last doesn't seem that a thing that would be unreasonable to ask for contributors to understand, it just needs to be on the spec :) Consider this my 👍

kevina · 2016-12-16T02:33:34Z

Okay it sounds like we are in agreement, use the next-to-last 2 characters to get a nice guaranteed 1024 level fanout no matter what type of Cid is used. @whyrusleeping agree?

I will update the code and also update the spec.

whyrusleeping · 2016-12-16T02:44:51Z

@kevina sounds good to me, ship it

jbenet · 2016-12-19T18:26:31Z

Last characters still hurt inspection-- when working with the hashes and dirs manually. But fine by me. I'll deal.
BTW, i think we want a larger fanout, may be a good time to do that too. I think very large repos see pain today. How big is the current fanout tuned for?

Separately, it would be nice to make sure the sharding is good for repos of any size. having a constant amount of sharding is bad because repos of different sizes will perform poorly. I think the right solution starts small then auto-scales up and "rearranges all the files" when necessary. this is a whole separate discussion though. For The Future.

jbenet · 2016-12-19T18:27:04Z

Wait, I didn't catch why "the next-to-last 2 chars" instead of "the last 2"-- what was the reason?

Kubuxu · 2016-12-19T18:39:25Z

base32 has different count of entropy bits in last byte depending on the length of the input as it converts 5 bytes of input to 8 bytes output. So if you hash is 101 bytes long then last byte will be have only 4.8 bits of entropy, last two bytes will have 12.8 bits instead of 16bits.

kevina · 2016-12-19T18:41:38Z

@jbenet, so that we are on the same page, the current Fanout should be 512, if we implement this as planned it will be 1024.

See #3463 (comment) and my followup comments for the justification for next-to-last. Without it we won't get consistent shading.

I strongly disagree with having a very large fanout by default. For the average size repo that could lead to lots of directories with only one or two files per directory (like we had before we switched to base32 encoding). I have no problem making the fanout configurable for the uncommon case of huge repos.

jbenet · 2016-12-20T05:20:18Z

Ah right. nice.

I think the fanout should adjust automatically. but we can get there.

Another option is to make it 2 tiers by default, instead of one.

kevina · 2016-12-20T05:23:55Z

@jbenet I'm confused. Are you in support of a configurable option? That should be fairly easy to implement.

whyrusleeping · 2016-12-20T19:25:50Z

@jbenet this change discussion is a fairly temporary one to make sure that we don't end up with horrible performance between now and when we can come up with a better block storage solution.

I'm okay moving forward with the next to last two given the discussion in this thread so far.

whyrusleeping · 2016-12-21T18:34:58Z

Alright, lets move forward with 'next to last two'.

With that, i want two things. I want the README file as @jbenet suggested to be generated by the codebase (filling in a template). If you like, I can write up roughly what i think this should look like, but the gist of it is you should show an example echo foobar | ipfs add to get a hash, then show the series of operations it takes to go from that hash, to a block on disk.

We should also add a file with a multicodec that describes the format, so in this case /repo/flatfs/shard-next-to-last/2 or something (@Kubuxu may have input here). This way you can open up a flatfs datastore without having to know what its sharding format is beforehand.

Then, the conversion tool should be able to handle the required changes to both of those files.

kevina · 2016-12-21T22:02:33Z

@whyrusleeping I will create a README and work on updating the spec.

We should also add a file with a multicodec that describes the format

I am unclear what you are after here.

whyrusleeping · 2016-12-21T23:36:47Z

Make sure the README file gets autogenerated based on the flatfs sharding method being used.

For the multicodec, have a file called VERSION or SHARDING or something in the flatfs directory, next to the README that contains a multicodec (a string) that tells us what sharding function is being used there. That way, we can 'open' an existing flatfs datastore without having to have the user specify the correct sharding function

kevina · 2016-12-22T00:05:08Z

@whyrusleeping, autogenerating the README was not what I had in mind and will take some additional work. In addition it may be difficult to include all the information @jbenet wants in an auto-generated README, mainly the code for the sharding function in use.

kevina · 2016-12-22T00:42:50Z

Here is a draft of the README as I believe @jbenet wanted it: https://gist.github.com/kevina/e217dd4c763aaaafdab9657935920da5

kevina · 2016-12-22T01:52:52Z

For the version file I am going to go with a file name SHARDING and am going to use the string

v1/next-to-last/2

The v1 is to allow for an upgrading when we do something more complicated than a simple function. If you want to prefix it with something maybe

/repo/flatfs/shard/v1/next-to-last/2

kevina · 2016-12-22T03:44:02Z

@whyrusleeping fell free to modify what I wrote to something that can more easily to autogenerted, I was going of of what @jbenet originally wrote.

kevina · 2016-12-22T08:33:42Z

See ipfs/go-ds-flatfs#13

Kubuxu · 2016-12-22T09:27:18Z

I am for the more verbose version, it can't hurt but might be useful.

whyrusleeping · 2017-01-04T19:56:17Z

@kevina go ahead and move forward with the one you wrote (including my comments on the gist), but make it a string constant in the source that gets written on datastore creation.

whyrusleeping · 2017-01-20T00:53:50Z

closed by #3608

kevina added the need/community-input Needs input from the wider community label Dec 3, 2016

kevina mentioned this issue Dec 3, 2016

Report: GC times and improvments from recent optimzations. #3462

Open

Kubuxu self-assigned this Dec 3, 2016

kevina changed the title ~~Default 5 digit prefix for flatfs is no longer suffent for CidV1~~ Default 5 digit prefix for flatfs is no longer sufficient for CidV1 Dec 3, 2016

kevina added the status/in-progress In progress label Dec 5, 2016

kevina added this to the ipfs 0.4.5 milestone Dec 5, 2016

Kubuxu removed their assignment Dec 5, 2016

kevina self-assigned this Dec 5, 2016

kevina mentioned this issue Dec 8, 2016

Provide utility to convert from one Sharding type to another. ipfs/go-ds-flatfs#9

Merged

kevina mentioned this issue Dec 16, 2016

Is this note still important in the spec? ipfs/specs#146

Closed

JustinDrake mentioned this issue Dec 22, 2016

What is the mapping between block hashes and storage in .ipfs/blocks/? ipfs-inactive/faq#213

Closed

kevina mentioned this issue Jan 6, 2017

Store sharding function used in the repo. ipfs/go-ds-flatfs#13

Merged

whyrusleeping closed this as completed Jan 20, 2017

whyrusleeping removed the status/in-progress In progress label Jan 20, 2017

Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463

Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463

Comments

kevina commented Dec 3, 2016 • edited Loading

kevina commented Dec 5, 2016

kevina commented Dec 5, 2016

Kubuxu commented Dec 5, 2016

jbenet commented Dec 6, 2016

jbenet commented Dec 6, 2016

kevina commented Dec 6, 2016

jbenet commented Dec 6, 2016

kevina commented Dec 6, 2016

kevina commented Dec 6, 2016

whyrusleeping commented Dec 6, 2016

kevina commented Dec 6, 2016

kevina commented Dec 7, 2016 • edited Loading

kevina commented Dec 8, 2016 • edited Loading

kevina commented Dec 8, 2016 • edited Loading

kevina commented Dec 8, 2016 • edited Loading

whyrusleeping commented Dec 8, 2016

kevina commented Dec 8, 2016

whyrusleeping commented Dec 8, 2016

Kubuxu commented Dec 14, 2016

kevina commented Dec 14, 2016

Kubuxu commented Dec 14, 2016

Kubuxu commented Dec 16, 2016

daviddias commented Dec 16, 2016 • edited Loading

Kubuxu commented Dec 16, 2016 • edited Loading

kevina commented Dec 16, 2016 • edited Loading

daviddias commented Dec 16, 2016 • edited Loading

kevina commented Dec 16, 2016

whyrusleeping commented Dec 16, 2016

jbenet commented Dec 19, 2016

jbenet commented Dec 19, 2016

Kubuxu commented Dec 19, 2016

kevina commented Dec 19, 2016 • edited Loading

jbenet commented Dec 20, 2016

kevina commented Dec 20, 2016

whyrusleeping commented Dec 20, 2016

whyrusleeping commented Dec 21, 2016

kevina commented Dec 21, 2016

whyrusleeping commented Dec 21, 2016

kevina commented Dec 22, 2016

kevina commented Dec 22, 2016

kevina commented Dec 22, 2016

kevina commented Dec 22, 2016

kevina commented Dec 22, 2016

Kubuxu commented Dec 22, 2016

whyrusleeping commented Jan 4, 2017

whyrusleeping commented Jan 20, 2017

kevina commented Dec 3, 2016 •

edited

Loading

kevina commented Dec 7, 2016 •

edited

Loading

kevina commented Dec 8, 2016 •

edited

Loading

kevina commented Dec 8, 2016 •

edited

Loading

kevina commented Dec 8, 2016 •

edited

Loading

daviddias commented Dec 16, 2016 •

edited

Loading

Kubuxu commented Dec 16, 2016 •

edited

Loading

kevina commented Dec 16, 2016 •

edited

Loading

daviddias commented Dec 16, 2016 •

edited

Loading

kevina commented Dec 19, 2016 •

edited

Loading