Skip to content
This repository has been archived by the owner on Jun 2, 2020. It is now read-only.

Improve CID concept doc for #95 #104

Merged
merged 3 commits into from
Aug 21, 2018
Merged

Conversation

rjharmon
Copy link
Contributor

No description provided.

Copy link
Collaborator

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to add a lot more info to this doc! I’ve left a bunch of feedback inline here.

Also, could you please rewrite this commit to include a license and signoff line as described in our contributor guide?

@@ -5,18 +5,39 @@ menu:
parent: concepts
---

A *content identifier* is a value that addresses a single piece of content in IPFS. It is mainly a cryptographic hash of the content, but is encoded as a [multihash](https://github.com/multiformats/multihash) and [multicodec](https://github.com/multiformats/multicodec). (Note: older CIDs have a different design — see [version 0](#version-0) below.)
## Summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please remove this heading? See the comment on your pinning PR for explanation: #105 (comment)


<!-- TODO: explain more of the details of how CID v1 is composed here. -->
A *content identifier*, or CID, is a label used for addressing content in IPFS. CID's are used as a standard way of pointing to pieces of information. CID's identify specific pieces of content stored in IPFS.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I feel kind of like each of these sentences is just telling me the same thing again. This would be fine if it was just the first sentence and you dropped the rest. (Side note: “label” is a great term here! Wish I’d thought of that 😄)

CID's

Please don’t use an apostrophe here, it’s not technically correct grammar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this more redundant than necessary, but the point is to ensure understanding comes across. I'll take another stab.


You can read up on the details in the [CID spec](https://github.com/ipld/cid). You might also want to check out the [CID inspector](http://cid-utils.ipfs.team/#zb2rhiVd5G2DSpnbYtty8NhYHeDvNkPxjSqA7YbDPuhdihj9L) for an interactive breakdown of CIDs.
CID's are based on that content's cryptographic hash - a different piece of content will have a different hash and will produce different CID's.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make “cryptographic hash” here a link to the concept doc on hashes?

I think it also might be helpful to explain a little more about why this matters, e.g. “Because a CID identifies content by what it is rather than by where it is stored, it gives us a way to retrieve the same content from many different peers on the network, rather than just one place — without CIDs, IPFS wouldn’t work at all.”

## Version 1
## Format of CID's

CID's can take a few different forms, each easy for humans and/or software to decode. Any specific CID can be transformed to other equivalent CID representations (for example, using different base, CID version or codec).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t say “easy.” People in this community have come from a variety of backgrounds and expertise, and what’s easy for some is not so for others. When you use this word, you risk discouraging someone who’s perfectly smart and capable, but had a hard time trying to do this task because they’d never done it before.

Any specific CID can be transformed to other equivalent CID representations

I don’t think we should say this. While you can up-convert a v0 CID to a v1 CID, there’s no explicit guarantee going forward that something similar will necessarily be possible if we ever invent a v2. Also, if you consider the hash as part of the CID, you cannot transform that (e.g. you couldn’t transform a SHA2-based CID to a SHA3-based CID if you all you have is the CID.)


CID's can take a few different forms, each easy for humans and/or software to decode. Any specific CID can be transformed to other equivalent CID representations (for example, using different base, CID version or codec).

CID v1 and later are comprised of some leading identifiers making it easy to identify which representation is used, along with the content-hash itself. In v1 and later, these include a multibase identifier, [multicodec](https://github.com/multiformats/multicodec) identifier, and CID version-identifier:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not say “and later” (there is no v2 and we have no idea what it might look like if and when it’s invented).

This paragraph and most of the stuff down to the next heading are really particular to the v1 format and doesn’t apply to v0 at all. I think you should probably just move it all under the “version 1” heading. I’m not sure if we actually need a “format” section alongside a “versions” section, since each version is actually a different format.

I think the only thing we really need to say at this point (before the version sections) is that there are two versions, and IPFS is slowly migrating to v1 by default. (Maybe we could add which commands do v0 by default and which do v1 by default).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the multiformat strategy, v2 (if/when) will at least have a compatible leading identifier to indicate that it is v2. But okay, I'll hold back on that aspect. Thanks.


These leading identifiers provide support for different formats to be used in future versions. Older CIDs have a different design that omits these identifiers — see [version 0](#version-0) below.

Using the first few bytes of the CID, the CID can easily be interpreted; the content can be fetched from IPFS, then decoded with the correct codec. For more details, check out the [CID specification](https://github.com/ipld/cid). It includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) and links to existing software implementations for decoding CID's.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above about saying things are “easy” here.


When IPFS was first designed, we used base 58-encoded multihashes as the content identifiers. (This is simpler, but much less flexible than newer CIDs.) It is still used by default when adding files and blocks to IPFS, so you should generally try to support them.
When IPFS was first designed, we specified the consistent use of base 58-encoded multihashes as the content identifiers. While this is s simpler, it is also much less flexible than newer CIDs. CIDv0 is still used by default when adding files and blocks to IPFS, so you should generally try to support them.
Copy link
Collaborator

@Mr0grog Mr0grog Aug 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wording (“we specified the consistent use of”) is harder to read (that’s important since there are lots of non-native English speakers in the community — and even more in the community we hope to grow) and doesn’t add any clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, reverting that bit.


The CID specification includes a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) you can use to distinguish CID v0 from newer versions.
There is a [decoding algorithm](https://github.com/ipld/cid/blob/ef1b2002394b15b1e6c26c30545fd485f2c4c138/README.md#decoding-algorithm) that shows how you can distinguish CID v0 from newer versions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick to active voice wherever possible. Avoid things like “there is” and “it is suggested,” etc.

License: MIT
Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>
@rjharmon
Copy link
Contributor Author

OK, I took another stab and I think I resolved all the concerns you expressed.


<!-- TODO: explain more of the details of how CID v1 is composed here. -->
CIDs are based on the content's [cryptographic hash](concepts/hashes). As a result, any difference in content will produce a different CID. Any IPFS node having the content will be able to match the hash and be able to retrieve the original content.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fact-check: will that node be able to match the hash if it has the content?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly. The caveat here is that the node has to have the content stored with the same hash algorithm (NOT the same CID — work is currently happening to make sure different base-encodings of the same hash will point to the same content on disk in ipfs/kubo#5231 — and the codec part of a CID is not involved in lookup, it’s only a hint for parsing the content once it’s found and received).

For example, say you have the content "Hello". If you hash it with SHA-256, you can make differently encoded CIDs of that:

In current versions of IPFS, they will be treated as totally separate content. In the next version (see issue linked above), they’ll match the same content. However, if you used a different hash algorithm:

Those will not match the same content, because they use a different hashing algorithm.

Maybe instead of the last sentence, say something like:

Each IPFS node keeps a list of hashes for all the content it stores; when it receives a request for a CID, it extracts the hash from the CID and checks it against the list, then returns the associated content if it’s found.

Copy link
Collaborator

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me! I left some feedback on your question — let me know if you plan to make changes or if I should merge as-is.


<!-- TODO: explain more of the details of how CID v1 is composed here. -->
CIDs are based on the content's [cryptographic hash](concepts/hashes). As a result, any difference in content will produce a different CID. Any IPFS node having the content will be able to match the hash and be able to retrieve the original content.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly. The caveat here is that the node has to have the content stored with the same hash algorithm (NOT the same CID — work is currently happening to make sure different base-encodings of the same hash will point to the same content on disk in ipfs/kubo#5231 — and the codec part of a CID is not involved in lookup, it’s only a hint for parsing the content once it’s found and received).

For example, say you have the content "Hello". If you hash it with SHA-256, you can make differently encoded CIDs of that:

In current versions of IPFS, they will be treated as totally separate content. In the next version (see issue linked above), they’ll match the same content. However, if you used a different hash algorithm:

Those will not match the same content, because they use a different hashing algorithm.

Maybe instead of the last sentence, say something like:

Each IPFS node keeps a list of hashes for all the content it stores; when it receives a request for a CID, it extracts the hash from the CID and checks it against the list, then returns the associated content if it’s found.

@rjharmon
Copy link
Contributor Author

The behavioral details you mention do make sense, though it could be argued that they can muddy up the key concepts we're trying to clarify. I'll push a patch that I think drives people to a right understanding that can be stable even as some of these technical particulars may evolve. PLMKWYT, thanks.

@Mr0grog
Copy link
Collaborator

Mr0grog commented Aug 21, 2018

I think that sounds good. Do you feel like those two points are important enough to be called out more clearly with a bulleted list?

CIDs are based on the content’s cryptographic hash. That means:

  • Any difference in content will produce a different CID and
  • The same piece of content added to two different IPFS nodes using the same settings will produce exactly the same CID.

License: MIT
Signed-off-by: Randall Harmon <rjharmon0316@gmail.com>
@rjharmon
Copy link
Contributor Author

ok, I adjusted it as you suggested

@Mr0grog
Copy link
Collaborator

Mr0grog commented Aug 21, 2018

Oh no! I was not trying to make a suggestion. I was actually asking a question.

@rjharmon
Copy link
Contributor Author

:) I almost bulleted them the first time 'round.

@Mr0grog
Copy link
Collaborator

Mr0grog commented Aug 21, 2018

Oh! All good then 👍

This includes a variety of small tweaks to spacing, typographical symbols, links, and minor changes to tighten up the language just a bit. Fixes ipfs-inactive#95.

License: MIT
Signed-off-by: Rob Brackett <rob@robbrackett.com>
@Mr0grog Mr0grog merged commit 2c96150 into ipfs-inactive:master Aug 21, 2018
@Mr0grog
Copy link
Collaborator

Mr0grog commented Aug 21, 2018

Thanks so much for this (and for keeping with it after my absurdly slow review), @rjharmon!

@meiqimichelle
Copy link
Contributor

+100 on the thanks, I've been watching these concepts docs go through and done a little happy dance every time -- thank you @rjharmon and @Mr0grog !

@rjharmon rjharmon deleted the patch-2 branch August 22, 2018 20:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants