Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Tsize values, fix HAMT and make recursive builder output match go-unixfs #21

Merged
merged 3 commits into from
Jan 25, 2022

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jan 21, 2022

OK, this started as something entirely different, I'll get to that in another PR later. But while using this I noticed that Tsize isn't getting set properly on directories. Then while fixing that, I noticed that the HAMT doesn't quite work properly, so I fixed that.

There's two main tests in here, one for a smaller directory structure and one for a very large one that breaks out the sharding. I've generated the CIDs from go-ipfs by ipfs adding them. (and final Tsize values calculated by looking at link sizes in root blocks and adding the root block size to the total: ipfs dag get bafybei... | jq '[.Links[].Tsize | tonumber] | add' & ipfs block get bafybei... | wc -c).

It's a breaking change because all of the builder BuildX() functions return datamodel.Link, unit64, error now so the size is included (mainly because it needs to be for recursive internal collection, but the reported size is an accurate representation of the structure's weight so should be useful in some situations too).

While in here, I replaced the block size calculation hack (load saved block as raw and use byte length) with a new, even more exotic hack, to count the bytes as they are pushed out of the encoder by wrapping the LinkSystem, EncoderChooser and Encoder at each save point. Not trivial, but saves the round-trip out of the block store. @warpfork @mvdan I think the need for block size at the Store() call for well formed dag-pb graphs makes a pretty strong case that this should be easier to retrieve out of the API—perhaps just Store() should return 3 values?

@rvagg rvagg changed the title Fix Tsize values, fix HAMT and match recursive builder output match go-unixfs Fix Tsize values, fix HAMT and make recursive builder output match go-unixfs Jan 21, 2022
@willscott
Copy link
Collaborator

I did not find good documentation the last time i went through this on what the size of directories should be set as. Is that something we can have a spec document on and point to in comments?

@ribasushi
Copy link

I did not find good documentation the last time i went through this on what the size of directories should be set as. Is that something we can have a spec document on

Note that there is no agreement on what Tsize ought to represent, or whether it is required in the first place. A retroactive spec would be... difficult.

cc @achingbrain

@rvagg
Copy link
Member Author

rvagg commented Jan 21, 2022

Kind of lacking on the spec front - there's this vague area above the encoding format layer (specs I was responsible for) https://ipld.io/specs/codecs/dag-pb/spec/ and the unixfs layer https://github.com/ipfs/specs/blob/master/UNIXFS.md. These changes get in the middle there. When writing the dag-pb spec, I didn't consider it necessary to go into detail about the proper contents of the various fields, I probably thought that was an IPFS spec concern. But it looks like IPFS speccing was higher level still. So our specs for this are in the form of code for now.

At one stage there was talk of deprecating Tsize since it really only can be advisory, but it's become an important tool in the meantime (e.g.) and having consistency across our libraries means we get CID convergence. I can imagine complaints if/when this library replaces go-unixfs in go-ipfs and starts giving different CIDs for the same data just because we were relaxed about Tsize.

@rvagg
Copy link
Member Author

rvagg commented Jan 24, 2022

I've changed the test for the sharded directory and added another one such that we now test the boundary condition where it flips from unsharded to sharded by adding an extra file. Again I constructed the data using go-ipfs and found the boundary condition with it. The CIDs and final reported size are calculated from go-ipfs outputs too. No extra fixes required, this lib is getting the boundary right.

Copy link
Collaborator

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know - I certainly saw maintaining absolute byte compatibility with things like switchover point as a somewhat lower priority for new implementations. As long as either style can decode properly, I'm less worried about them generating and switching over at the same point.

I know there are some edge cases that would like it, but it's equally true that in setting that expectation of 'these are the only bytes that should be generated for this input' we make it harder to evolve the format and move to updated versions of unixfs.

@rvagg rvagg merged commit d3fccee into main Jan 25, 2022
@rvagg rvagg deleted the rvagg/tsize-all-the-things branch January 25, 2022 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants