The unified multicodecs theory #16

daviddias · 2016-09-25T07:46:39Z

The unified multicodecs theory.

The theory that unites all the self-described multiformats for a beautifully colored 🚲🏚

tl;dr; This PR updates the multicodecs table to incorporate all the multiformats binary packed tables (multihash and multiaddr) into one.

With the introduction of CIDv1, we needed a way to describe several types of data(multicodec like) in a way that a program could parse(self described), that didn't increase the data size by much, and so, multicodec-packed was born.

One of the requirements for multicodec-packed is that it needed to be as extensible as the normal multicodec, so that applications developers and protocol engineers could add their own multicodec tables (dictionaries) of their own data structures.

Once we added this in, we realized how multihash, multiaddr or even multibase are just users of multicodec-packed with their own custom tables already.

Given this, we found that it is extremely convinient to have all of these tables converge into one, so that we can expand it overtime and avoid code clashes in table extensions.

This PR fixes some spec errors and merges the multiaddr, multihash and multibase tables in a new table with a more clear format.

TODO:

Solve the clash issue (udp and sha1 share the same code)
Get the table reviewed by @jbenet

Future work:

~~Bring in all the mimetypes and give them an packed code~~ Seee The unified multicodecs theory #16 (comment)

daviddias · 2016-09-26T03:55:51Z

README.md

+|                               | /ip4/           |                         | 0x04        | n/a            |
+|                               | /ip6/           |                         | 0x29        | n/a            |
+|                               | /tcp/           |                         | 0x06        | n/a            |
+|                               | /udp/           |                         | 0x11        | n/a            |


clash with sha1

daviddias · 2016-09-26T07:24:54Z

Update

After much discussion (🚲🏚), we've understood that we can improve the clarity of how multicodec, multicodec-packed and multistream are design and their purpose, if we adjust the names of these protocols. In simple terms this means that:

rename: multicodec-packed -> multicodec
rename: multicodec -> multistream
rename: multistream -> multistream-select

This way:

multicodec becomes the protocol that simply defines { codec: code } pairs (e.g: { ip4: 0x04 })
multistream becomes the protocol that gets used for describing streams of data in a human readable way (<varint>/<codec>\n)
multistream-select becomes the protocol for handshaking protocols between two endpoints, which uses multistream describe the protocol the endpoints want to speak.

jbenet · 2016-09-26T06:29:18Z

README.md

@@ -38,67 +38,43 @@ Moreover, this self-description allows straightforward layering of protocols wit
 `multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description:

 ```sh
-<multicodec-header><encoded-data>
-# or
-<varint-len><code>\n<encoded-data>


put back the newlines

jbenet · 2016-09-26T06:32:52Z

README.md

+| **Multiformats**                                                                                         |
+| 0x182f6d756c7469636f6465632f  | /multicodec/    |                         | 0x30        | n/a            |
+| 0x162f6d756c7469686173682f    | /multihash/     |                         | 0x31        | n/a            |
+| 0x162f6d756c7469616464722f    | /multiaddr/     |                         | 0x32        | n/a            |


add multibase

jbenet · 2016-09-26T08:54:02Z

README.md

+| **IPLD formats**                                        |
+| dag-pb          | MerkleDAG protobuf      |             |
+| dag-cbor        | MerkleDAG cbor          |             |
+| eth-rlp         | Ethereum Block RLP      |             |


add:

git

bitcoin

stellar

daviddias · 2016-10-02T08:41:11Z

@jbenet I'm making js-cid complete (as in, being able to handle bs58(multihash), raw multihash, cid, cidStr and cid construction by parts (version, codec, multihash)) and I found myself in a position of 'luck', because neighter CID v0 or CID v1 clash with the multihash table.

For now it is ok, because multihash only starts at 0x11, which gives us 14 more CID versions (if we ever need to change it). However, a thought crossed my mind that CID code should be governed as a multicodec, so that we can then update the CID table, add another version that doesn't represent an incremental update to the previous number, so that all the parsers implemented meanwhile do not break.

whyrusleeping · 2016-10-12T16:01:20Z

README.md

+IPLD formats
+dag-pb,             MerkleDAG protobuf,       0x70
+dag-cbor,           MerkleDAG cbor,           0x71
+eth-block,          Ethereum Block (RLP),     0x90


I don't think these should be above 0x80.

why is that?

Because that means they take up two bytes when encoded. Anything that is going to be commonly used we would really love to have take only a single byte.

I noticed sctp is also above 0x80 (0x84)

Transport multicodecs are 'more okay' because they get transferred less, in the other hand, format multicodecs get transferred every time a block is transferred, so a byte actually means a lot.

@whyrusleeping is this still a concern? I believe it stopped being as soon you merged IPLD in go-ipfs into master, correct?

and, there are simply more than 127 things we "really care about". so 2 bytes is in the long run unavoidable.

whyrusleeping · 2016-10-12T16:03:39Z

README.md

+codec,              description,              code
+
+miscelaneous
+bin,                raw binary,               0x00


I'm not sure i like zero being 'binary'. Zero is a default value in a lot of cases and this makes it difficult to tell between 'unknown/invalid' and 'binary'

Binary could be 0x55 which is 0b10101010 and is also value of Ethernet frame preamble.

whyrusleeping · 2016-10-12T16:06:31Z

@diasdavid can we sit down today and finish this? Its blocking my work on ipld

daviddias · 2016-10-12T17:41:29Z

@whyrusleeping sure, why is it blocking though? The IPLD formats have codes.

whyrusleeping · 2016-10-12T20:06:29Z

I'm taking 0x22 for murmur3

wanderer · 2016-10-12T21:56:47Z

why not reserve the top level table only for the different encoding types? so something like

0x00 an encoding table
0x01 bases encodings
0x02 serialization formats
0x03 multicodec
0x04 multihash
0x05 multiaddr
0x06 multibase
0x07 multihashes
0x08 multiaddrs
0x09 archiving formats
0x10 IPLD formats
....
ect

0x00 would be specail. What would follow is would be a custom encoding table which would map strings to their binary representation. This would allow apps to use dynamically create new tables.

wanderer · 2016-10-13T08:33:31Z

own multicodec tables (dictionaries) of their own data structures.

@diasdavid is a proposed format for the dictionaries?

EDIT: i opened an issue on it here #18

daviddias · 2016-11-09T13:14:27Z

sha1 and udp clash solution

I've added an extra 0 byte to udp multicodec, so that it doesn't colid with sha1 anymore. It is more convenient to change udp than sha1 as changing sha1 would mean that all the multihashes that use it.

~~Adding an extra 0 byte is enough to differentiate it, but keeping the value 11 on the code.~~

UPDATE:

@Kubuxu add a very valuable point that I missed entirely here -- #16 (comment) -- , moving udp to 0x0111 instead of 0x0011.

daviddias · 2016-11-09T13:15:52Z

Stamp remaining multicodecs

I'm feeling resistant to stamp the remaining multicodecs only by following rules of thumb. Is there any reasoning we can use to pick the remaining unspecified codes? This is the last item in order to merge this PR

Kubuxu · 2016-11-09T14:56:41Z

update: SOLVED

I think that specifying that 0x0011 and 0x11 are something different is really bad idea. This means that we can't use normal varint libraries but we will need to craft our own. Our spec doesn't specify it and then it isn't really and integer, it is variable length bytestream where highest bit says "there is data still coming".

I asked the question about this quite a time ago: multiformats/unsigned-varint#5
but it was about wether we should canonize such miss-formatted varints before storing them to for example CBOR notation as if we don't do that same object has infinitely many binary CIDs.

Summing up, I think we shouldn't abuse varints in this way, we should specify that the number is prefixed with infinitely many leading zeros and that the canonical form is when there is minimum number of bytes taken to express the number. Then before crafting IPLD objects we would canonize such miss-formatted CIDs which should be fast and easy operation.

jbenet · 2016-11-12T20:12:41Z

@Kubuxu can you follow up in multiformats/unsigned-varint#5, or another issue in that repo, with examples? i'm not understanding what you're getting at. Either way, this spec uses https://github.com/multiformats/unsigned-varint so whatever that does, this will do too. let's resolve it there.

jbenet · 2016-11-12T20:33:06Z

@diasdavid

Is there any reasoning we can use to pick the remaining unspecified codes?

I don't think there's a nice function for assigning these numbers. unfortunately it's a nasty problem, because there's a bunch of "domain specific" functions that may make sense-- but don't make sense throughout.

A function i can think of that works uniformly well (even if not very well) is "first come first assign". meaning, a queue, assign them as they're requested. this is certainly going to make people unhappy about having to use 2 bytes for a commonly used function, but it's unavoidable anyway given there's definitely more than 127 common things users will care about...

So, no. i can't think of a nice way that is not "domain specific and non-universal" or "rules of thumb" :( --- but there may be one if people want to keep thinking :)

jbenet · 2016-11-12T20:33:38Z

@diasdavid what is the diff here? what protocols changed value in the table? (i.e. what will break, if anything? I see UDP, is that the only one)

jbenet · 2016-11-12T21:11:57Z

README.md

+bitcoin-block,      Bitcoin Block,            0xb0
+bitcoin-tx,         Bitcoin Tx,               0xb1
+stellar-block,      Stellar Block,            0xd0
+stellar-tx,         Stellar Tx,               0xd1
 ```


this table should move to a CSV, and the readme should point to it or embed it.

jbenet · 2016-11-12T21:13:12Z

curious why "binary has been migrated from 0x00 to 0x55" -- what was the reasoning? (sgtm, jw)

Kubuxu · 2016-11-12T21:14:50Z

@jbenet #16 (comment)

but there was discussion elsewhere on it too.

daviddias · 2016-11-13T05:00:09Z

@diasdavid what is the diff here? what protocols changed value in the table? (i.e. what will break, if anything? I see UDP, is that the only one)

Yes, that one and the binary which moved to 0x55 as @Kubuxu mentioned and got stamped in go-ipfs.

"first come first assign".

Sounds good to me, will add a note to that in the readme.

daviddias · 2016-11-13T05:25:20Z

All right, this looks ready to merge :)

I'll give it a day and merge it Monday (tomorrow)

…w codes

daviddias · 2016-11-16T11:38:28Z

All right, I gave it 3 days, going for the merge 🎉🎉

It was changed from 17 to 273 in multiformats/multicodec#16.

BREAKING CHANGE: The UDP code was changed in the multicodec table The UDP code is now `273` instead of `17`. For the full discussion of this change please see multiformats/multicodec#16. Fixes #17.

daviddias mentioned this pull request Sep 25, 2016

docs: add codecs for serialization formats #15

Closed

daviddias force-pushed the spec/update branch from 05dbfb1 to 39712a2 Compare September 26, 2016 02:39

daviddias commented Sep 26, 2016

View reviewed changes

daviddias changed the title ~~spec/update~~ The unified multicodecs theory Sep 26, 2016

daviddias force-pushed the spec/update branch from 9431794 to ad444ff Compare September 26, 2016 06:33

This was referenced Sep 26, 2016

rename to go-multistream-select multiformats/go-multistream#11

Closed

rename to js-multistream-select multiformats/js-multistream-select#23

Closed

jbenet reviewed Sep 26, 2016

View reviewed changes

daviddias added a commit that referenced this pull request Sep 26, 2016

update with regards to #16

48f7366

daviddias force-pushed the spec/update branch from 48f7366 to f2d877e Compare September 26, 2016 08:07

daviddias added a commit that referenced this pull request Sep 26, 2016

update with regards to #16

f2d877e

daviddias force-pushed the spec/update branch from f2d877e to b9eba22 Compare September 26, 2016 08:10

daviddias added a commit that referenced this pull request Sep 26, 2016

update with regards to #16

b9eba22

daviddias force-pushed the spec/update branch from b9eba22 to a0eab30 Compare September 26, 2016 08:18

daviddias added a commit that referenced this pull request Sep 26, 2016

update with regards to #16

a0eab30

jbenet reviewed Sep 26, 2016

View reviewed changes

wanderer mentioned this pull request Sep 29, 2016

Map the entire Ethereum State into IPFS with ipld ipld/specs#27

Closed

daviddias mentioned this pull request Sep 29, 2016

IPFS & libp2p <3 Ethereum - Roadmap of things to get everything working together nicely 👌 ipfs/notes#173

Open

10 tasks

This was referenced Oct 2, 2016

fix: getCodec, return codec multiformats/js-multicodec#3

Merged

Awesome IPLD endeavour ipld/js-ipld#60

Merged

Captain.log - IPLD v1 spec ipld/specs#13

Closed

whyrusleeping requested changes Oct 12, 2016

View reviewed changes

whyrusleeping reviewed Oct 12, 2016

View reviewed changes

resolve udp sha1 clash

ce94334

move udp from 0x0011 to 0x0111

6d7f49d

Kubuxu mentioned this pull request Nov 12, 2016

Canonical form multiformats/unsigned-varint#5

Closed

jbenet reviewed Nov 12, 2016

View reviewed changes

daviddias removed the needs review label Nov 13, 2016

daviddias force-pushed the spec/update branch 2 times, most recently from 3a6f28e to 76a23f7 Compare November 13, 2016 05:30

break multicodec table into its own file, add a note of how to add ne…

0c9d2df

…w codes

daviddias force-pushed the spec/update branch from 76a23f7 to 0c9d2df Compare November 13, 2016 05:31

daviddias merged commit d6e0ec1 into master Nov 16, 2016

ghost mentioned this pull request Nov 17, 2016

Dead link to multicodec-packed ipld/ipld#6

Closed

ghost mentioned this pull request Mar 26, 2017

Integrating IPFS (Haskell Implementation) MatrixAI/Forge-Package-Archiving#1

Closed

llopv mentioned this pull request Jul 6, 2017

The link to multicodec-packed.md is broken. multiformats/multistream#4

Closed

This was referenced Nov 28, 2018

fix udp code multiformats/multiaddr#78

Merged

UDP code is incorrect multiformats/js-multiaddr#72

Closed

Stebalien added a commit to multiformats/rust-multiaddr that referenced this pull request Nov 28, 2018

fix udp protocol code

839cb58

It was changed from 17 to 273 in multiformats/multicodec#16.

This was referenced Nov 28, 2018

fix udp protocol code multiformats/rust-multiaddr#32

Merged

fix UDP code multiformats/js-multiaddr#73

Merged

The unified multicodecs theory #16

The unified multicodecs theory #16

Conversation

daviddias commented Sep 25, 2016 • edited Loading

The unified multicodecs theory.

daviddias Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

daviddias commented Sep 26, 2016

Update

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviddias commented Oct 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whyrusleeping commented Oct 12, 2016

daviddias commented Oct 12, 2016

whyrusleeping commented Oct 12, 2016

wanderer commented Oct 12, 2016

wanderer commented Oct 13, 2016 • edited Loading

daviddias commented Nov 9, 2016 • edited Loading

sha1 and udp clash solution

daviddias commented Nov 9, 2016

Stamp remaining multicodecs

Kubuxu commented Nov 9, 2016 • edited by daviddias Loading

jbenet commented Nov 12, 2016

jbenet commented Nov 12, 2016

jbenet commented Nov 12, 2016 • edited Loading

Choose a reason for hiding this comment

jbenet commented Nov 12, 2016 • edited Loading

Kubuxu commented Nov 12, 2016 • edited Loading

daviddias commented Nov 13, 2016

daviddias commented Nov 13, 2016

All right, this looks ready to merge :)

daviddias commented Nov 16, 2016

daviddias commented Sep 25, 2016 •

edited

Loading

daviddias Sep 26, 2016 •

edited

Loading

wanderer commented Oct 13, 2016 •

edited

Loading

daviddias commented Nov 9, 2016 •

edited

Loading

Kubuxu commented Nov 9, 2016 •

edited by daviddias

Loading

jbenet commented Nov 12, 2016 •

edited

Loading

jbenet commented Nov 12, 2016 •

edited

Loading

Kubuxu commented Nov 12, 2016 •

edited

Loading