Improve CID concept doc for #95 #104

rjharmon · 2018-08-11T19:16:45Z

No description provided.

Mr0grog

Thanks for taking the time to add a lot more info to this doc! I’ve left a bunch of feedback inline here.

Also, could you please rewrite this commit to include a license and signoff line as described in our contributor guide?

Mr0grog · 2018-08-18T00:21:45Z

content/guides/concepts/cid.md

@@ -5,18 +5,39 @@ menu:
        parent: concepts
 ---

-A *content identifier* is a value that addresses a single piece of content in IPFS. It is mainly a cryptographic hash of the content, but is encoded as a [multihash](https://github.com/multiformats/multihash) and [multicodec](https://github.com/multiformats/multicodec). (Note: older CIDs have a different design — see [version 0](#version-0) below.)
+## Summary


Would you please remove this heading? See the comment on your pinning PR for explanation: #105 (comment)

Mr0grog · 2018-08-18T00:24:47Z

content/guides/concepts/cid.md


-<!-- TODO: explain more of the details of how CID v1 is composed here. -->
+A *content identifier*, or CID, is a label used for addressing content in IPFS. CID's are used as a standard way of pointing to pieces of information.  CID's identify specific pieces of content stored in IPFS.


Hmmm, I feel kind of like each of these sentences is just telling me the same thing again. This would be fine if it was just the first sentence and you dropped the rest. (Side note: “label” is a great term here! Wish I’d thought of that 😄)

CID's

Please don’t use an apostrophe here, it’s not technically correct grammar.

Maybe this more redundant than necessary, but the point is to ensure understanding comes across. I'll take another stab.

Mr0grog · 2018-08-18T00:29:38Z

content/guides/concepts/cid.md


-You can read up on the details in the [CID spec](https://github.com/ipld/cid). You might also want to check out the [CID inspector](http://cid-utils.ipfs.team/#zb2rhiVd5G2DSpnbYtty8NhYHeDvNkPxjSqA7YbDPuhdihj9L) for an interactive breakdown of CIDs.
+CID's are based on that content's cryptographic hash - a different piece of content will have a different hash and will produce different CID's.


Can make “cryptographic hash” here a link to the concept doc on hashes?

I think it also might be helpful to explain a little more about why this matters, e.g. “Because a CID identifies content by what it is rather than by where it is stored, it gives us a way to retrieve the same content from many different peers on the network, rather than just one place — without CIDs, IPFS wouldn’t work at all.”

Mr0grog · 2018-08-18T00:41:15Z

content/guides/concepts/cid.md

-## Version 1
+## Format of CID's
+
+CID's can take a few different forms, each easy for humans and/or software to decode.  Any specific CID can be transformed to other equivalent CID representations (for example, using different base, CID version or codec).  


Please don’t say “easy.” People in this community have come from a variety of backgrounds and expertise, and what’s easy for some is not so for others. When you use this word, you risk discouraging someone who’s perfectly smart and capable, but had a hard time trying to do this task because they’d never done it before.

Any specific CID can be transformed to other equivalent CID representations

I don’t think we should say this. While you can up-convert a v0 CID to a v1 CID, there’s no explicit guarantee going forward that something similar will necessarily be possible if we ever invent a v2. Also, if you consider the hash as part of the CID, you cannot transform that (e.g. you couldn’t transform a SHA2-based CID to a SHA3-based CID if you all you have is the CID.)

Mr0grog · 2018-08-18T00:48:46Z

content/guides/concepts/cid.md

+
+CID's can take a few different forms, each easy for humans and/or software to decode.  Any specific CID can be transformed to other equivalent CID representations (for example, using different base, CID version or codec).  
+
+CID v1 and later are comprised of some leading identifiers making it easy to identify which representation is used, along with the content-hash itself. In v1 and later, these include a multibase identifier, [multicodec](https://github.com/multiformats/multicodec) identifier, and CID version-identifier:


We should not say “and later” (there is no v2 and we have no idea what it might look like if and when it’s invented).

This paragraph and most of the stuff down to the next heading are really particular to the v1 format and doesn’t apply to v0 at all. I think you should probably just move it all under the “version 1” heading. I’m not sure if we actually need a “format” section alongside a “versions” section, since each version is actually a different format.

I think the only thing we really need to say at this point (before the version sections) is that there are two versions, and IPFS is slowly migrating to v1 by default. (Maybe we could add which commands do v0 by default and which do v1 by default).

According to the multiformat strategy, v2 (if/when) will at least have a compatible leading identifier to indicate that it is v2. But okay, I'll hold back on that aspect. Thanks.

Mr0grog · 2018-08-18T00:51:28Z

content/guides/concepts/cid.md

+
+These leading identifiers provide support for different formats to be used in future versions.  Older CIDs have a different design that omits these identifiers — see [version 0](#version-0) below.
+
+Using the first few bytes of the CID, the CID can easily be interpreted; the content can be fetched from IPFS, then decoded with the correct codec. For more details, check out the [CID specification](https://github.com/ipld/cid).  It includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) and links to existing software implementations for decoding CID's.


Same note as above about saying things are “easy” here.

Mr0grog · 2018-08-18T00:55:16Z

content/guides/concepts/cid.md


-When IPFS was first designed, we used base 58-encoded multihashes as the content identifiers. (This is simpler, but much less flexible than newer CIDs.) It is still used by default when adding files and blocks to IPFS, so you should generally try to support them.
+When IPFS was first designed, we specified the consistent use of base 58-encoded multihashes as the content identifiers.  While this is s simpler, it is also much less flexible than newer CIDs.  CIDv0 is still used by default when adding files and blocks to IPFS, so you should generally try to support them.


I think this wording (“we specified the consistent use of”) is harder to read (that’s important since there are lots of non-native English speakers in the community — and even more in the community we hope to grow) and doesn’t add any clarity.

ok, reverting that bit.

Mr0grog · 2018-08-18T00:56:07Z

content/guides/concepts/cid.md


-The CID specification includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) you can use to distinguish CID v0 from newer versions.
+There is a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) that shows how you can distinguish CID v0 from newer versions.


Please stick to active voice wherever possible. Avoid things like “there is” and “it is suggested,” etc.

License: MIT Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>

rjharmon · 2018-08-18T19:50:07Z

OK, I took another stab and I think I resolved all the concerns you expressed.

rjharmon · 2018-08-18T19:51:41Z

content/guides/concepts/cid.md


-<!-- TODO: explain more of the details of how CID v1 is composed here. -->
+CIDs are based on the content's [cryptographic hash](concepts/hashes).  As a result, any difference in content will produce a different CID.  Any IPFS node having the content will be able to match the hash and be able to retrieve the original content.  


Fact-check: will that node be able to match the hash if it has the content?

Mostly. The caveat here is that the node has to have the content stored with the same hash algorithm (NOT the same CID — work is currently happening to make sure different base-encodings of the same hash will point to the same content on disk in ipfs/kubo#5231 — and the codec part of a CID is not involved in lookup, it’s only a hint for parsing the content once it’s found and received).

For example, say you have the content "Hello". If you hash it with SHA-256, you can make differently encoded CIDs of that:

Base 58 (default): zb2rhdYtXM8X3Jfsm6VrmXnmcSqtfgHZbhYRJ32ENkmARL78K

Base 32: bafkreidgubc3iuqqfrm5qqhmbf6vtwkgpyj2h42pmskokop72mwbxm27da

In current versions of IPFS, they will be treated as totally separate content. In the next version (see issue linked above), they’ll match the same content. However, if you used a different hash algorithm:

SHA-256 Base 58: zb2rhdYtXM8X3Jfsm6VrmXnmcSqtfgHZbhYRJ32ENkmARL78K (same as above)

SHA-512 Base 58: zB7NCgi6ywUVNus2k55hDjeyHPmJrtY3eMB8S9T2Unjyp9Lc3H9yvTq6b4nXieFCcg4awXzKRrSo5GULNR1TMtSbNhGKp

Those will not match the same content, because they use a different hashing algorithm.

Maybe instead of the last sentence, say something like:

Each IPFS node keeps a list of hashes for all the content it stores; when it receives a request for a CID, it extracts the hash from the CID and checks it against the list, then returns the associated content if it’s found.

Mr0grog

This looks pretty good to me! I left some feedback on your question — let me know if you plan to make changes or if I should merge as-is.

Mr0grog · 2018-08-20T20:28:38Z

content/guides/concepts/cid.md


-<!-- TODO: explain more of the details of how CID v1 is composed here. -->
+CIDs are based on the content's [cryptographic hash](concepts/hashes).  As a result, any difference in content will produce a different CID.  Any IPFS node having the content will be able to match the hash and be able to retrieve the original content.  


Mostly. The caveat here is that the node has to have the content stored with the same hash algorithm (NOT the same CID — work is currently happening to make sure different base-encodings of the same hash will point to the same content on disk in ipfs/kubo#5231 — and the codec part of a CID is not involved in lookup, it’s only a hint for parsing the content once it’s found and received).

For example, say you have the content "Hello". If you hash it with SHA-256, you can make differently encoded CIDs of that:

Base 58 (default): zb2rhdYtXM8X3Jfsm6VrmXnmcSqtfgHZbhYRJ32ENkmARL78K

Base 32: bafkreidgubc3iuqqfrm5qqhmbf6vtwkgpyj2h42pmskokop72mwbxm27da

In current versions of IPFS, they will be treated as totally separate content. In the next version (see issue linked above), they’ll match the same content. However, if you used a different hash algorithm:

SHA-256 Base 58: zb2rhdYtXM8X3Jfsm6VrmXnmcSqtfgHZbhYRJ32ENkmARL78K (same as above)

SHA-512 Base 58: zB7NCgi6ywUVNus2k55hDjeyHPmJrtY3eMB8S9T2Unjyp9Lc3H9yvTq6b4nXieFCcg4awXzKRrSo5GULNR1TMtSbNhGKp

Those will not match the same content, because they use a different hashing algorithm.

Maybe instead of the last sentence, say something like:

Each IPFS node keeps a list of hashes for all the content it stores; when it receives a request for a CID, it extracts the hash from the CID and checks it against the list, then returns the associated content if it’s found.

rjharmon · 2018-08-20T23:11:30Z

The behavioral details you mention do make sense, though it could be argued that they can muddy up the key concepts we're trying to clarify. I'll push a patch that I think drives people to a right understanding that can be stable even as some of these technical particulars may evolve. PLMKWYT, thanks.

Mr0grog · 2018-08-21T03:37:30Z

I think that sounds good. Do you feel like those two points are important enough to be called out more clearly with a bulleted list?

CIDs are based on the content’s cryptographic hash. That means:

Any difference in content will produce a different CID and

The same piece of content added to two different IPFS nodes using the same settings will produce exactly the same CID.

License: MIT Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>

rjharmon · 2018-08-21T19:07:21Z

ok, I adjusted it as you suggested

Mr0grog · 2018-08-21T19:22:31Z

Oh no! I was not trying to make a suggestion. I was actually asking a question.

rjharmon · 2018-08-21T19:25:01Z

:) I almost bulleted them the first time 'round.

Mr0grog · 2018-08-21T19:25:49Z

Oh! All good then 👍

This includes a variety of small tweaks to spacing, typographical symbols, links, and minor changes to tighten up the language just a bit. Fixes ipfs-inactive#95. License: MIT Signed-off-by: Rob Brackett <rob@robbrackett.com>

Mr0grog · 2018-08-21T23:49:29Z

Thanks so much for this (and for keeping with it after my absurdly slow review), @rjharmon!

meiqimichelle · 2018-08-22T00:04:52Z

+100 on the thanks, I've been watching these concepts docs go through and done a little happy dance every time -- thank you @rjharmon and @Mr0grog !

Mr0grog suggested changes Aug 18, 2018

View reviewed changes

rjharmon force-pushed the patch-2 branch from 9b6a698 to 47bcb02 Compare August 18, 2018 19:47

Improve CID concept doc for ipfs-inactive#95

b6f48b3

License: MIT Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>

rjharmon force-pushed the patch-2 branch from 47bcb02 to b6f48b3 Compare August 18, 2018 19:49

rjharmon commented Aug 18, 2018

View reviewed changes

Mr0grog approved these changes Aug 20, 2018

View reviewed changes

docs: CIDs as deterministic

78e47db

License: MIT Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>

rjharmon force-pushed the patch-2 branch from 7030801 to 78e47db Compare August 21, 2018 19:06

Clean up CID concept doc typography, links, etc.

4eb95db

This includes a variety of small tweaks to spacing, typographical symbols, links, and minor changes to tighten up the language just a bit. Fixes ipfs-inactive#95. License: MIT Signed-off-by: Rob Brackett <rob@robbrackett.com>

Mr0grog merged commit 2c96150 into ipfs-inactive:master Aug 21, 2018

rjharmon deleted the patch-2 branch August 22, 2018 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CID concept doc for #95 #104

Improve CID concept doc for #95 #104

rjharmon commented Aug 11, 2018

Mr0grog left a comment •

edited

Loading

Mr0grog Aug 18, 2018

Mr0grog Aug 18, 2018

rjharmon Aug 18, 2018

Mr0grog Aug 18, 2018

Mr0grog Aug 18, 2018

Mr0grog Aug 18, 2018

rjharmon Aug 18, 2018

Mr0grog Aug 18, 2018

Mr0grog Aug 18, 2018 •

edited

Loading

rjharmon Aug 18, 2018

Mr0grog Aug 18, 2018

rjharmon commented Aug 18, 2018

rjharmon Aug 18, 2018

Mr0grog Aug 20, 2018

Mr0grog left a comment

Mr0grog Aug 20, 2018

rjharmon commented Aug 20, 2018

Mr0grog commented Aug 21, 2018

rjharmon commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

rjharmon commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

meiqimichelle commented Aug 22, 2018


		<!-- TODO: explain more of the details of how CID v1 is composed here. -->
		A content identifier, or CID, is a label used for addressing content in IPFS. CID's are used as a standard way of pointing to pieces of information. CID's identify specific pieces of content stored in IPFS.


		You can read up on the details in the [CID spec](https://github.com/ipld/cid). You might also want to check out the [CID inspector](http://cid-utils.ipfs.team/#zb2rhiVd5G2DSpnbYtty8NhYHeDvNkPxjSqA7YbDPuhdihj9L) for an interactive breakdown of CIDs.
		CID's are based on that content's cryptographic hash - a different piece of content will have a different hash and will produce different CID's.


		CID's can take a few different forms, each easy for humans and/or software to decode. Any specific CID can be transformed to other equivalent CID representations (for example, using different base, CID version or codec).

		CID v1 and later are comprised of some leading identifiers making it easy to identify which representation is used, along with the content-hash itself. In v1 and later, these include a multibase identifier, [multicodec](https://github.com/multiformats/multicodec) identifier, and CID version-identifier:


		These leading identifiers provide support for different formats to be used in future versions. Older CIDs have a different design that omits these identifiers — see [version 0](#version-0) below.

		Using the first few bytes of the CID, the CID can easily be interpreted; the content can be fetched from IPFS, then decoded with the correct codec. For more details, check out the [CID specification](https://github.com/ipld/cid). It includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) and links to existing software implementations for decoding CID's.


		When IPFS was first designed, we used base 58-encoded multihashes as the content identifiers. (This is simpler, but much less flexible than newer CIDs.) It is still used by default when adding files and blocks to IPFS, so you should generally try to support them.
		When IPFS was first designed, we specified the consistent use of base 58-encoded multihashes as the content identifiers. While this is s simpler, it is also much less flexible than newer CIDs. CIDv0 is still used by default when adding files and blocks to IPFS, so you should generally try to support them.


		The CID specification includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) you can use to distinguish CID v0 from newer versions.
		There is a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) that shows how you can distinguish CID v0 from newer versions.


		<!-- TODO: explain more of the details of how CID v1 is composed here. -->
		CIDs are based on the content's [cryptographic hash](concepts/hashes). As a result, any difference in content will produce a different CID. Any IPFS node having the content will be able to match the hash and be able to retrieve the original content.

Improve CID concept doc for #95 #104

Improve CID concept doc for #95 #104

Conversation

rjharmon commented Aug 11, 2018

Mr0grog left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog Aug 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjharmon commented Aug 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjharmon commented Aug 20, 2018

Mr0grog commented Aug 21, 2018

rjharmon commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

rjharmon commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

Mr0grog commented Aug 21, 2018

meiqimichelle commented Aug 22, 2018

Mr0grog left a comment •

edited

Loading

Mr0grog Aug 18, 2018 •

edited

Loading