Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fr32-sha2-256-trunc254-padded-binary-tree multihash #331

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aschmahmann
Copy link

This PR reserves a single multihash code. The request here comes from the Filecoin ecosystem which is already using the underlying hashes here in relation to the codes reserved in #170 and #172.

However, it turns out the approach taken in the reservation of those codes was insufficient and as a result there’s a new spec proposal for a different multihash to better represent the data being referred to.

At a high level there were three approaches for describing these trees given here. We went with option 2, but that turns out to have been a mistake so now we’re going with option 3.


Why is this a good idea when you already had a code here?

You live and you learn. It turns out that for the particular data being dealt with the approach taken originally was problematic (and is IIUC largely why the related codec entries were tagged as filecoin rather than IPLD) and this one makes life easier. The reason CID and multihash exist is to make it easier to evolve through mistakes, this is one example of that.

Should other applications of merkle-tree hashes lean on custom multihashes rather than reusing existing ones?

As with my comment around IPLD codecs from a while ago I think the answer lies in what the application gain/loses out on by not reusing existing multihashes.

Is it reasonable to have multiple codes dealing with effectively the same data?

Situationally, yes. In this case it turns out there are better and worse ways to describe the same data and people might even disagree on what those are. The table allows us to support these options.

A related example I’ve heard mentioned in the past was if someone wanted to register codes for the internal components of the blake3 tree so that data could be referred to as either a Blake3 multihash of a 1GiB file, or a more specific Blake3 internal-node multihash for the root of the tree whose leaves are the 1GiB file. On the surface this seems like a reasonable request as well.

cc @ribasushi @willscott @rvagg @vmx

(p.s. I hope Rod and Volker forgive me for making them relive #161, #170 and #172)

@vmx
Copy link
Member

vmx commented Jul 27, 2023

From the FRC abstract:

it turns out to be much more useful to be able to express the tuple of (Root hash, size of tree) than just the root hash.

This is how I think it should be done. The size doesn't really matter for the hash. The size information is additional context that is needed for some application. To me, this is not what Multihash is for.

I've discussed with a lot of people about how to transmit additional context with CIDs at the LabWeek/IPFSCamp in 2022. I think there should be a way, but I still believe it shouldn't be part of the CID, but something separate.

@aschmahmann
Copy link
Author

This is how I think it should be done. The size doesn't really matter for the hash. The size information is additional context that is needed for some application. To me, this is not what Multihash is for.

Yes/No. The situation as it currently stands makes referring to the data needed by CID not particularly useful. It was acknowledged that the v1 Piece CIDs are not a real CIDs which is why the codecs are labeled as filecoin rather than ipld (#161 (comment) and #172 (comment)). This seems to be because (as mentioned in those issues) it wasn't clear why/how people were going to use these CIDs or if they'd be used outside of Filecoin internals. However, it turns out people now want to move around this data by CID and the current multihash + codec combination doesn't do that.

Fixing mistakes is IMO precisely what multiformats is for and realizing that Filecoin Pieces might need real Content IDentifiers (CIDs) so they can be used in other contexts (e.g. IPFS p2p transfer of data by piece CID) and that a new multihash is necessary to do so is the cost of learning (and the benefit of having multiformats).

@vmx
Copy link
Member

vmx commented Aug 2, 2023

I talked with various people about it, I now have a better understanding.

I've only thought about the producing side of things. You put in some data and you get a specific Merkle tree out. Though if you think about CIDs also as a way to retrieve the data, then the height matters. You would retrieve data and would want to verify that the data is the one you expected. Without the height, you could get a higher level of the Merkle tree (random bytes and not the data you expected) and it would still verify correctly.

This means that it now makes sense to me to include the height.

@rvagg
Copy link
Member

rvagg commented Aug 4, 2023

Should other applications of merkle-tree hashes lean on custom multihashes rather than reusing existing ones?

From an IPLD perspective (i.e. if these are multihashes intended for CIDs) then I think the answer would come down to what we are addressing. The original "bmt" entry we had, which I think was even thought to be useful for bitcoin binary merkle addressing, turns out to be not so useful because you're addressing the entire base set of data, which is not really what you typically want. From the IPLD perspective you typically want to navigate through links to individual pieces of these things.

For CommP, if we want it to address the entire base data being "hashed", then viewing the merkle process as a hash function itself makes more sense.

When we start doing things like inclusion proofs then it gets a bit more iffy. Does https://github.com/filecoin-project/FIPs/blob/master/FRCs/frc-0058.md muddy these waters a bit? Does data-segment neatly fold up into this new conception of CommP? Or do we keep those concerns entirely separate from this?

Also: you're going to have to work on table formatting (see CI) .. sorry but your long name is going to make that tricky, or you might have to shorten it.

@rvagg
Copy link
Member

rvagg commented Aug 4, 2023

And broadly this is +1 from me, just with some general questions about data-segment, and also the need to update the table formatting.

@alanshaw
Copy link
Member

Can be merged in draft at least? We've been using this in web3.storage for a long time now.

@rvagg
Copy link
Member

rvagg commented Apr 29, 2024

can be merged when it's mergeable; table formatting needs fixing up at least

@@ -146,6 +146,7 @@ transport-bitswap, transport, 0x0900, draft, Bits
transport-graphsync-filecoinv1, transport, 0x0910, draft, Filecoin graphsync datatransfer
transport-ipfs-gateway-http, transport, 0x0920, draft, HTTP IPFS Gateway trustless datatransfer
multidid, multiformat, 0x0d1d, draft, Compact encoding for Decentralized Identifers
fr32-sha2-256-trunc254-padded-binary-tree, multihash, 0x1011, draft, A balanced binary tree hash used in Filecoin Piece Commitments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fr32-sha2-256-trunc254-padded-binary-tree, multihash, 0x1011, draft, A balanced binary tree hash used in Filecoin Piece Commitments
fr32-sha2-256-trunc254-padbintree,multihash, 0x1011, draft, A balanced binary tree hash used in Filecoin Piece Commitments

Suggestion for shortened name as per https://github.com/filecoin-project/FIPs/pull/808/files#r1361196071 . The huge realignment needed is an implicit flag that the proposed name is obnoxiously long.

Accepting above suggestion fixes the alignment requested by @rvagg as a side effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that pass the validator script? I thought it checked for spaces - but also happy to have the script edited to make accommodations for particularly large columns if someone wants to come up with an algorithm as I don't really like the idea of shuffling the whole table for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some additional shortening options: sha2-256 -> sha256, trunc254 -> trunc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants