Update go-ipld-git to a go-ipld-prime codec #46

willscott · 2021-02-23T08:16:35Z

The existing go-ipld-git codec allows for translation between git file data, and go-ipld-format node representations of the IPLD dag. This work attempts to instead provide a go-ipld-prime representation of that dag.

Decoder codec (git -> ipld-prime)
Encoder code (ipld-prime -> git)
Round-trip tests

fix #45

mvdan

SGTM, though take it with a grain of salt as this is the first time I look into git object formats. I also assume that the existing test suite is extensive enough to give us confidence of no regressions.

unmarshal.go

blob.go

personinfo.go

tree.go

mvdan · 2021-04-05T16:33:33Z

tree.go

-		_, ok := t.entries[p]
-		if !ok {
-			return nil
+	t := Type.Tree__Repr.NewBuilder()


I'm not sure I follow why we have a Tree__Repr builder here, and its result is built and assigned into a Tree builder, which is then again built. It seems like the other decode funcs follow the same pattern. Wouldn't a single Build per Decode be enough?

i think this came out of tests:
the default is to use a basic.any as the assembler, and if that was directly used to make a list / map you end up with a top level basic node rather than the schema'd type that we have in this package.
When that's true, the decode path can't fast-path, and has to re assign that basic node into the schema type.

I think I agree with mvdan here -- it's a little odd to have a concrete reference like this in a method that is already accepting a NodeAssembler as a parameter. When there is a NodeAssembler as a parameter, I would usually expect it to just be used directly. (And maybe documentation can state something about "it is recommended to use a Tree__ReprAssembler here for best results".)

This doesn't seem dangerously incorrect as written, but does seem a little baroque. And it is probably doing a little more memory dancing than is necessary or optimal, too.

mvdan · 2021-04-05T16:34:33Z

tree.go

+		return err
+	}
+	for {
+		node, err := DecodeTreeEntry(rd)


same as before; why not pass the la.AssembleValue() assembler to the decoder and let it assign directly into it?

tree.go

mvdan · 2021-04-05T16:40:50Z

util.go

+
+func cidToSha(c cid.Cid) []byte {
+	h := c.Hash()
+	return h[len(h)-20:]


Hm, isn't this a bit wasteful if we only end up using part of the result? Is this to strip the multihash prefix? Might be good to document it a bit.

right. this just uses the sha1 part of the multihash without the prefix to provide the git-compatible hash.

rvagg · 2021-07-21T06:23:11Z

@willscott @mvdan this codec changes all of the struct fields to camel-case, object -> Object, etc. Would you mind if we change it back for consistency with the old codec and the JS one? It's also closer to the native format's idea of naming as well. I can make the change here if you're both OK with that.

rvagg · 2021-07-21T06:41:21Z

Some additional thoughts here where the differences in forms can be seen from a pathing perspective: ipfs/kubo#8287

If you're OK with it, I'd like to try and get the forms here back to the original.

mvdan · 2021-07-21T09:45:30Z

That sounds fine to me - I'm not sure if ipld-prime codegen expects capitalized names, for the sake of exported names in the generated Go code.

willscott · 2021-07-21T14:56:11Z

I also have no problem with changing name case back for consistency

rvagg · 2021-07-22T11:55:48Z

0436bc7 now has the field names matching the original (lower-case, and some other minor adjustments). tagType is unfortunate since we can't use type in the codegen. Tree is now a map too so you can look up by directory name as per the original codec, which lets us path the same way back in the sharness test.

I think we need to do something about this strict need to match Go conventions on these names since it's not always going to match what you want your data model form to look like.

mvdan · 2021-07-22T12:05:47Z

Out of curiosity, would it be enough to add "repr" mapping of field names so that the repr layer uses lowercase names?

Aside from that, I wonder if it's a bit weird for the schema to have the field names be lowercase, but the type names be uppercase.

rvagg · 2021-07-22T12:16:34Z

Out of curiosity, would it be enough to add "repr" mapping of field names so that the repr layer uses lowercase names?

Sadly, no, because it's pathing at the data model layer that we want to care about, not encoding. Having renames is only going to help if we cared about these objects going into a standard codec (dag-cbor, dag-json), otherwise they're kind of useless here.

rvagg · 2021-07-22T12:18:43Z

I wonder if it's a bit weird for the schema to have the field names be lowercase, but the type names be uppercase.

This seems fine to me, it actually looks pretty reasonable to my eyes (eyes that have been trained on C-family-style and aren't as accustomed to the upper-case first character rule, granted). Have a look at the schemas that Hannah wrote up in https://github.com/ipfs/go-unixfsnode/blob/main/data/gen/main.go (although the schema field names don't match what's being generated, but you can see the same thing going on there).

mvdan · 2021-07-22T12:23:00Z

SGTM, thanks for confirming on both accounts. I've replied to ipld/go-ipld-prime#206 :)

Found in ipfs/go-ipld-git#46 integrating into go-ipfs which wants to load typed nodes when it encounters typed links. It errors because it's trying to load the typed node into the type of the typed link rather than the referenced type.

waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)

…sage)

rvagg · 2021-07-29T03:21:37Z

@mvdan @willscott I've brought this up to what I think is a mergable state:

rebased on master - which brought in a bunch of stuff to get it working with the PL actions workflows
updated the README
done some minor jiggering of the types to match the README, and therefore closer to the original, (one rename of "text" to "message" and a shuffle of some of the struct fields, which mostly shouldn't have any impact on anything).
brought back the JSON tests using dag-json
updated to latest go-ipld-prime which fixes the typed link bug

and a few other minor things.

The main difference that I'm in two minds about is date handling - we're pulling out date and timezone and keeping them faithfully in their original format so round-trips work properly. The original codec presented them as a nice stringified date but with the original values kept internally.
An approach I've toyed with in the Bitcoin codec (in JS at least and I was thinking of doing the same in Go) was to do both - keep originals and present the nice usable forms. It just raises questions about what happens when/if they get out of sync. Two possible approaches would be: (1) error if they're out of sync when serializing, or (2) ignore the usable forms entirely when serializing.

For now I think it'd be fine to just leave this as is and leave it to the user to reform the stringified version if they want it, we could add back in a utility function to do that if needed, or evolve this code further in the future.

IMO this can be merged now.

willscott

all of these changes, mostly the lower-casing of fields, looks good to me.

We should get some sort of nod from stewards before merging, i think?

willscott · 2021-07-29T04:41:20Z

gen/gen.go

-	}, schema.SpawnStructRepresentationStringjoin(" ")))
+		schema.SpawnStructField("date", "String", false, false),
+		schema.SpawnStructField("timezone", "String", false, false),
+		schema.SpawnStructField("email", "String", false, false),
+		schema.SpawnStructField("name", "String", false, false),
+	}, schema.SpawnStructRepresentationMap(map[string]string{})))


why did you change this to a map rather than a stringjoin?

Because it's not being used as a stringjoin in any way? We have a custom PersonInfo#GitString() that's used to encode it in the git format and parsePersonInfo() to parse the input string for decoding. Stringjoin here is only going to mean that when it's encoded with dag-cbor or dag-json that it comes out as <date> <timezone> <email> <name>, which IMO is lossy and probably not what we're wanting out of this. Unless I'm missing something?

<date> <timezone> <email> <name> is the git format though. I would hope that wouldn't be lossy.

it's $name <$email> $maybedate $maybetimezone - it's lossy because the email has a <> around it and we can use the < to delineate the name which itself can contain any number of spaces, so simple stringjoin isn't enough

Agree with Rod's take here: this seems better to do at the codec layer and yield up to data model as a map.

If I was going to transliterate some git data to dag-json, I'd probably want this to become a map in dag-json. So that means the codec should do this parse/transform immediately.

mvdan

LGTM

mvdan · 2021-07-30T13:07:12Z

README.md

+type GpgSig string
+
+type PersonInfo struct {
+  date String


since you're not keeping dates in a human-friendly single string format, I wonder if you could make this unix timestamp an Int too.

aschmahmann

Mostly looks good to me, left a few comments/questions. I'm sure you folks have much greater context on the go-ipld-prime usage than I do.

It looks like there are some breaking changes to the data output by this codec. We should be dropping the list of breaking changes into the release tag for when we release this so I think it'd be great if we could writeup the list of breaking changes to the codec (at the spec level) for anyone who happens to be affected. For the purposes of review having an explanation of why a breaking change was made would also be appreciated.

I don't know if we have guidelines on how codec names should match to the internal names of the format we're representing with IPLD, but I'd suspect in general we'd try to match the native names.

README.md

aschmahmann · 2021-07-30T12:56:04Z

README.md

+    "date": "1503667703",
+    "timezone": "+0200",


Is the idea here that this is easier to work with an a single object? It doesn't seem to match how the Git spec handles things internally. Although perhaps it's fair for us to try and make things easier for our users here.

I was talking with @Stebalien about this and as far as we can tell there hasn't been much tooling developed around this codec so making a breaking change that helps people out and seems sane is probably fine.

(delayed reply to this) the main idea here is to keep the data lossless and not do too many tricks to make it palatable to the user.

Encoded form looks something like:

author A U Thor <author@example.com> 1465981137 +0000

But there's even some flexibility in what those final two fields can contain and apparently whether we get two, one or zero of them (I'm not sure about that detail though). So we treat them as strings and present them to the user as they are.

The current version of this codec keeps them internally but only presents a nice human-readable form as a DAG node, which isn't going to work when we're treating DAG traversal as traversal over the data model like we're doing with ipld-prime.

We did discuss (in this thread and other places) some options for making the human-readable form emerge out of the data to be available, but @willscott made the case that since this codec is a bit of a reference codec that we'd want to point others to for how to implement one that we'd better keep it simple for now.

So, lossless data, pure data model, no trickery, user gets to interpret the nodes as they want.

README.md

warpfork

Big picture, LGTM.

+1 big happy on the fix to capitizalization (field names being downcase). That makes more sense to me.

willscott · 2021-07-30T21:58:31Z

Making sure we've got all the loose ends covered here:

i looked at the at date handling code again. I think i prefer the current splitting out of timezone in the data model, and i believe it actually is backwards compatible - previous codec encodings will have the full date (with timezone) in their date field, and the prime codec will take that encoding and produce the correct git representation of that data.
We should see if we can clean up the the tagType field nomenclature per Field name restrictions tied too closely to Go language rules ipld/go-ipld-prime#206 (comment)

aschmahmann · 2021-07-30T22:07:26Z

Yes, and a review for other breaking changes. mergeTag seems like a breaking change that seems like something we can just restore. I'm not sure if there are others at the moment.

Posting what the release notes for this release would look like would also be great since as soon as this is merged we're going to cut a release anyway and that release is going to need release notes 😄.

rvagg · 2021-07-31T05:27:49Z

Have fixed up tagType to type thanks to Eric's hint, and mergeTag is now mergetag, plus a few more readability and clarity tweaks.

Original code has a MergeTag as well as a Tag with the main difference that one has a text field and the other a message field but otherwise they are the same. This one just has Tag now with message which I think is more in line with the git model. That'll be a breaking change from a DAG traversal perspective so we should note it in the release notes.

Will put together some basic release notes next week.

* mergeTag -> mergetag * tagType -> type

rvagg · 2021-08-02T03:15:13Z

Release notes:

BREAKING CHANGES

go-ipld-git is now a go-ipld-prime IPLD codec. Use Decode(na ipld.NodeAssembler, r io.Reader) error and Encode(n ipld.Node, w io.Writer) error for direct use if required.
There is now only one Tag type, MergeTag has been removed which had a text property. Use Tag's message property instead to retrieve the tag message from a commit's mergetag. i.e. <commit>/mergetag/message instead of <commit>/mergetag/text.
PersonInfo no longer exposes the human-readable RFC3339 format date field as a DAG node. The date and timezone fields are kept as their original string forms (to enable precise round-trips) as they exist in encoded Git data. e.g. <commit>/author/date now returns seconds in string form rather than an RFC3339 date string. Use this value and <commit>/author/timezone to reconstruct the original if needed.

warpfork · 2021-08-03T11:38:19Z

Might I suggest we drop that change info into a CHANGELOG.md file right in the diff? PRs disappear from memory quickly and aren't easily indexed over.

(I'm pretty happy with the loose format in go-ipld-prime/CHANGELOG.md#released-changes, if having an outline to riff from greases the wheels. :))

warpfork · 2021-08-03T12:15:22Z

(Informational note, no action required: I found myself nerdsniped by mergetag. Yes, that's a thing. Yes, keeping it alllowercase is correct. Docs here: http://github.com/git/git/blob/master/Documentation/technical/signature-format.txt .)

rvagg · 2021-08-04T02:49:25Z

Have added a changelog and fixed the readme bork.

Normally I don't like changelogs because it's too easy for them to get out of sync, but recently I've adopted the release-on-every-merge practice in JS and it includes both tag + github release + changelog updates and all I need to do is make sure my commit messages are informative and standardised, e.g. https://github.com/rvagg/cborg/blob/master/CHANGELOG.md & https://github.com/rvagg/cborg/releases & https://www.npmjs.com/package/cborg all from pressing the merge button. Can recommend, although Go folks have other things to worry about in the release process I suppose.

warpfork · 2021-08-04T10:20:21Z

Sweet. I think this is officially ready for the mergeparty (later this week? early next week?) where we land a bunch of these similar PRs in other repos at once.

aschmahmann · 2021-08-05T16:34:15Z

init.go

+func init() {
+	mc.RegisterEncoder(cid.GitRaw, Encode)
+	mc.RegisterDecoder(cid.GitRaw, Decode)
+}


Is this necessary for functioning, or just a convenience function for updating the global registry via module imports?

@warpfork I recall your aversion to global registries. Does encouraging codec creators to do this, but codec users to use custom LinkSystems when needed seem like a reasonable compromise?

i believe this was the compromise we ended up at, yes

Yep, that's exactly it.

Global registries have downsides, but using package-init-time's natural synchronization is just too valuable. And having ways to opt-out of using the global registry, if some really needs to, seems to mitigate the biggest downsides handily. So: 👍

rvagg · 2021-08-10T04:46:57Z

I have created a draft release & tag for this, v0.1.0: https://github.com/ipfs/go-ipld-git/releases

I'm working on some assumptions here about Go (and PL) release practices that may not be quite correct. So someone correct me in the next day or two before I press merge and create the release & tag. I would prefer to use semver properly and do a v1.0.0 but I don't think we're in the habit of doing that with Go so I'm assuming v0.1.0 is appropriate, a tag is sufficient and a GitHub release is a bonus.

willscott self-assigned this Feb 23, 2021

willscott marked this pull request as ready for review April 1, 2021 00:19

willscott requested a review from mvdan April 2, 2021 16:30

mvdan approved these changes Apr 5, 2021

View reviewed changes

willscott mentioned this pull request May 19, 2021

don't copy mutexes #48

Merged

rvagg mentioned this pull request Jul 22, 2021

(ipld-prime) Use mh.Set() for "hash" -> mhType & Make git plugin test work again ipfs/kubo#8287

Merged

This was referenced Jul 26, 2021

fix: typed links LinkTargetNodePrototype should return ReferencedType ipld/go-ipld-prime#211

Merged

switch dag put cmd to directly use prime ipfs/kubo#7995

Merged

rvagg force-pushed the prime branch from 59d313f to 718d11f Compare July 29, 2021 03:08

willscott added 8 commits July 29, 2021 13:09

remove GX cruft

716ba8d

Codec for decoding from byte stream to ipld prime Nodes for git data

dae2706

add encode path

9889aa2

tests updated.

fc1615b

waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)

back to linksystem

d2d95b3

tests run, but something is unhappy

d0f7f53

fix some bugs

8a28576

fix bugs in tree and commit handling

e3c68b0

rvagg added 3 commits July 29, 2021 13:09

feat: type Parents links

a3ace0d

chore: update docs, reconcile with gen types (ordering and s/text/mes…

d811145

…sage)

fix: add back json tests

b2b169a

rvagg force-pushed the prime branch from 718d11f to b2b169a Compare July 29, 2021 03:09

willscott commented Jul 29, 2021

View reviewed changes

mvdan approved these changes Jul 30, 2021

View reviewed changes

aschmahmann reviewed Jul 30, 2021

View reviewed changes

warpfork approved these changes Jul 30, 2021

View reviewed changes

add back maintainers to readme

4e0e59b

rvagg mentioned this pull request Aug 2, 2021

Field name restrictions tied too closely to Go language rules ipld/go-ipld-prime#206

Closed

Further minimise changes to current release

5cc00cf

* mergeTag -> mergetag * tagType -> type

rvagg force-pushed the prime branch from 5200aad to 5cc00cf Compare August 2, 2021 03:04

Add CHANGELOG, minor README fix

9218898

rvagg mentioned this pull request Aug 4, 2021

Make Resolve(nil) return map of top-level objects, fixes #20 #35

Closed

aschmahmann reviewed Aug 5, 2021

View reviewed changes

bump ipld prime

6059d7a

willscott merged commit 44c3a56 into master Aug 12, 2021

aschmahmann mentioned this pull request Aug 23, 2021

Release v0.10 ipfs/kubo#8176

Closed

62 tasks

aschmahmann mentioned this pull request Aug 22, 2022

Create IPLD-Prime codec that can surplant the ipld-format functionality in this repo #45

Closed

Update go-ipld-git to a go-ipld-prime codec #46

Update go-ipld-git to a go-ipld-prime codec #46

Conversation

willscott commented Feb 23, 2021 • edited Loading

mvdan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork Jul 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Jul 21, 2021

rvagg commented Jul 21, 2021

mvdan commented Jul 21, 2021

willscott commented Jul 21, 2021

rvagg commented Jul 22, 2021

mvdan commented Jul 22, 2021

rvagg commented Jul 22, 2021

rvagg commented Jul 22, 2021

mvdan commented Jul 22, 2021

rvagg commented Jul 29, 2021

willscott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork Jul 30, 2021 • edited Loading

Choose a reason for hiding this comment

mvdan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aschmahmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork left a comment

Choose a reason for hiding this comment

willscott commented Jul 30, 2021

aschmahmann commented Jul 30, 2021

rvagg commented Jul 31, 2021

rvagg commented Aug 2, 2021 • edited Loading

warpfork commented Aug 3, 2021

warpfork commented Aug 3, 2021

rvagg commented Aug 4, 2021

warpfork commented Aug 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Aug 10, 2021

willscott commented Feb 23, 2021 •

edited

Loading

warpfork Jul 30, 2021 •

edited

Loading

warpfork Jul 30, 2021 •

edited

Loading

rvagg commented Aug 2, 2021 •

edited

Loading