Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD-139. Gogo Must Go. #28386

Closed
wants to merge 7 commits into from
Closed

RFD-139. Gogo Must Go. #28386

wants to merge 7 commits into from

Conversation

mdwn
Copy link
Contributor

@mdwn mdwn commented Jun 27, 2023

This describes an initial RFD for removing Gogo from Teleport.

Note: The initial point of this RFD is not to provide a comprehensive solution, but to provide a starting place for ideation and to try to assemble many viewpoints to try to figure out what concerns we need to take in and how best to address them. I heavily suspect this RFD will change quite a bit.

Rendered

This describes an initial RFD for removing Gogo from Teleport.
- Allows us to remove gogoproto extensions without affecting backend storage.
- Allows us to utilize fields in the backend more efficiently and store larger objects.

The primary disadvantage here is that manually examining the backend is more difficult as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "examining" isn't the only concern here - sometimes during incidents we need to perform surgery on the Teleport store in order to be able to bring something back online.

Copy link
Contributor

@tigrato tigrato Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time ago, @strideynet and I took a look at a hidden protobuf tag called json_name that when used with jsonpb allows you to do the same as gogo proto JSON but unfortunately it does not support structure embedding which we also abuse a lot


##### `gogoproto.stdtime`

We will need to migrate to using `timestamppb` directly instead of relying on gogo's `stdtime` tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this migration something we can start doing now ? Would this be a wire-compatible change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK all of these changes should be wire compatible. This is something that we can do now (I'm putting together a POC for it as we speak, actually).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not allowed to break protobuf wire compatibility anyway. 😎

##### `gogoproto.nullable`

We will need to remove the `gogoproto.nullable` tags, which overall shouldn't be a huge impact.
We should migrate to using `optional` as defined [here](https://protobuf.dev/programming-guides/proto3/#field-labels).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know support for optional hasn't been in proto3 since initiation - and I've not used it yet with proto3. If we can't use it, then we can use the google provided wrappers: https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/wrappers.proto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable tags are actually client side only and they are null if they are not set in proto. This happens when the value is the default value for the type so using a wrapper isn't an option as well because they change the wire representation

If a int field has zero value, it won't be marshalled to proto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yuck, I didn't even think about the fact that optional changes the wire representation. Yeah, this won't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, all the same, CheckAndSetDefaults will realistically handle this for most structures. I think we can probably get away with just removing nullable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, nullable has no effect so we can handle internally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think y'all got this backwards - optional is the default for all proto3 fields. It is required that got dropped between proto2 and proto3, because as it turns out required was a terrible idea to begin with.

I don't recommend writing optional before our proto fields, and strongly not recommend messing with required (since I'm mentioning it). None of this changes the wire format - that is something that just doesn't happen with Google protobuf.

Copy link
Contributor

@strideynet strideynet Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as simple as it is on the tin. You are correct that proto3 gets rid of required and leans towards what could be called optional fields. However, an optional keyword was introduced in proto v3.15. With the Go generator, this usually yields a pointer to the expected type w/ something like:

message MyMessage {
  optional string foo = 1;
}
type MyMessage struct {
  // various pb fields snipped
  Foo *string
}

I can't say I'm a huge fan of pointers being used to indicate the presence/non-presence of a field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hoped we learned our lesson about custom generator options by now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional is a perfectly legitimate feature (especially on scalar fields where you wouldn't otherwise get any "presence" info), but it's not supported by the javascript/typescript generator we're using, and gogoproto makes terrible code for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My usual advice would be to not rely on field presence - ideally you shouldn't care whether it's empty, nil, set explicitly, a default value, etc. If you do care about presence, then it's best to design in a way that makes it explicitly clear. A good example is using a FieldMask to inform partial updates - https://cloud.google.com/apis/design/standard_methods#update.

- Allows us to remove gogoproto extensions without affecting backend storage.
- Allows us to utilize fields in the backend more efficiently and store larger objects.

The primary disadvantage here is that manually examining the backend is more difficult as
Copy link
Contributor

@strideynet strideynet Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here, as explored in devicetrust recently, is to write separate Go structs for storage, rather than marshaling/unmarshaling pb messages - https://github.com/gravitational/teleport.e/blob/master/lib/devicetrust/storage/types.go

This is something we can probably undertake gradually as an option if we wanted to break away from using the same types for storage as we do for RPC. There's various arguments for/against this which might be worth investigating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm obviously biased, but I think this is the way to go - in fact, I think it's almost always the better way to go, as having API representations dictate storage is quite limiting (point in case this RFD section).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. I'll update the RFD to reflect this.

rfd/0139-gogo-must-go.md Show resolved Hide resolved
- Allows us to remove gogoproto extensions without affecting backend storage.
- Allows us to utilize fields in the backend more efficiently and store larger objects.

The primary disadvantage here is that manually examining the backend is more difficult as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm obviously biased, but I think this is the way to go - in fact, I think it's almost always the better way to go, as having API representations dictate storage is quite limiting (point in case this RFD section).

##### `gogoproto.nullable`

We will need to remove the `gogoproto.nullable` tags, which overall shouldn't be a huge impact.
We should migrate to using `optional` as defined [here](https://protobuf.dev/programming-guides/proto3/#field-labels).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think y'all got this backwards - optional is the default for all proto3 fields. It is required that got dropped between proto2 and proto3, because as it turns out required was a terrible idea to begin with.

I don't recommend writing optional before our proto fields, and strongly not recommend messing with required (since I'm mentioning it). None of this changes the wire format - that is something that just doesn't happen with Google protobuf.

Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are broadly 2 ways to go about this:

  1. We stay with gogo, or remain gogo-compatible. In this path we do the least bit of work in Teleport and either fork gogo ourselves (which we might be forced to if we do nothing for too long) or adopt an active gogo fork. (Not saying I like this, but it's an option to be considered.)

  2. We move away from Gogo, likely to plain Google protobuf. In this path we systematically rid ourselves of Gogo extensions until, eventually, we can move out of it. This is more work, but has the potential to land us in a close-to-ideal situation in the future (this is a no-brainer for a new project, for example). We'll need to map problems and solutions clearly so everyone can chime in as we do it.

@tigrato
Copy link
Contributor

tigrato commented Jun 28, 2023

I lean towards this approach

We move away from Gogo, likely to plain Google protobuf. In this path we systematically rid ourselves of Gogo extensions until, eventually, we can move out of it. This is more work, but has the potential to land us in a close-to-ideal situation in the future (this is a no-brainer for a new project, for example). We'll need to map problems and solutions clearly so everyone can chime in as we do it.

Because it fixes several problems while moving away from gogo.

  • Get rid of Gogo
  • Allow us to split our AuthService into smaller services without having a huge impact now
  • Allow us to decouple API and storage objects

In theory, it takes more work but thinking about the future it can save us 2 major refactors

rfd/0139-gogo-must-go.md Show resolved Hide resolved
rfd/0139-gogo-must-go.md Outdated Show resolved Hide resolved

##### Write custom marshalers for objects

In order to keep our user facing representations of the objects, we will need to write
custom JSON and YAML marshalers for our objects. This may be time intensive but not overall
a significant amount of work.

#### Embedding types (removing `gogoproto.embed`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes like this will cause Go-level incompatible changes in api/.

I think we should discuss whether these can be avoided and, if not (or not worth it), what kind of "migration" path we are thinking of implementing. (Ie, how we'll deal with customer breakages and communication, potential e/ breakages, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. I'm not sure we can avoid breakages without keeping gogo intact, or at least a segment of Teleport that still maintains has it. Even getting rid of nullable will cause a Go level breakage here. Maybe we could have an api-legacy that we maintain for 3 versions and then remove?

I don't think we can avoid e breakages without keeping the object interfaces intact, or without doing some deep wizardry on the backend service interfaces, which sounds hideous.

Customer breakages should definitely be well communicated, but I'm not sure of the best mechanism there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Communication and a plan seem good enough in this case. It's worth noting that this could cause pains for a few versions while everything is moved to new formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we currently care much about code-level incompatibility in the API module, fwiw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we caused breakages for customers a few times because of API changes. @zmb3 for feedback on how much we should worry about this.

@mdwn
Copy link
Contributor Author

mdwn commented Jun 29, 2023

Hey all, I've introduced a new object AccessList and attempted to implement it in a "correct" way. Do you all mind taking a look? I'm not expecting a full on review, but perhaps this could be used to inform this RFD.

#28385
#28479

Thoughts?


##### Write custom marshalers for objects

In order to keep our user facing representations of the objects, we will need to write
custom JSON and YAML marshalers for our objects. This may be time intensive but not overall
a significant amount of work.

#### Embedding types (removing `gogoproto.embed`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we currently care much about code-level incompatibility in the API module, fwiw.

rfd/0139-gogo-must-go.md Outdated Show resolved Hide resolved
Comment on lines +172 to +173
It's imperative that these objects are compatible with the current JSON/YAML representations in Teleport.
That is, the objects that are in Teleport today should unmarshal properly into our new internal objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we freeze the current resource types and require a version bump (to a "new world" version that uses protojson in the backend and a dedicated service for RPCs) for any future change, to avoid accidental mistakes slipping through, or do you think we can chug along for a while manually adding new fields to the conversion functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea. I'm less sold on the protojson purely for the reasons I described in #28386 (comment).


##### `gogoproto.stdtime`

We will need to migrate to using `timestamppb` directly instead of relying on gogo's `stdtime` tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not allowed to break protobuf wire compatibility anyway. 😎

Comment on lines +110 to +111
Rather than use gRPC's wire format or serialized JSON format, we should consider writing
a storage object format. This has a number of benefits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use protojson for this? It's a canonical encoding of protobuf, and I don't see us abandoning protobuf anytime soon. My guess is that if we require a hand-rolled json struct for this we'll end up writing a json struct that's almost field-by-field compatible with the protobuf counterpart and we'll just deserialize-serialize it to convert it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few reasons for doing this:

  • By writing our own storage object, we decouple the API representation of objects from our backend.
  • We can ensure backwards compatibility with existing backends with no migration step, though migration isn't terrible.
  • We can use these objects subsequently for CLI display and maintain the current k8s-like JSON/YAML structure that we've got.

What do you think?

rfd/0139-gogo-must-go.md Outdated Show resolved Hide resolved
@mdwn mdwn marked this pull request as ready for review June 30, 2023 17:55
@github-actions github-actions bot added rfd Request for Discussion size/md labels Jun 30, 2023
@github-actions github-actions bot requested review from kimlisa and ravicious June 30, 2023 17:55
@ravicious
Copy link
Member

Removing myself as a bot assigned reviewer as I won't have anything meaningful to add here.

@ravicious ravicious removed their request for review July 3, 2023 13:25
@mdwn
Copy link
Contributor Author

mdwn commented Jul 3, 2023

Okay, so after some offline discussion, I think a major point of contention is storage objects vs. using protojson directly. I'm in favor of storage objects because I think they give us greater flexibility and allow us to obfuscate away protobuf specific things from our internal logic. However, if we use protojson, then we can get around needing to write custom marshalers/unmarshalers entirely.

I wanted to kick off a discussion about this. Thoughts @strideynet, @codingllama, @espadolini, @tigrato?

@codingllama
Copy link
Contributor

Okay, so after some offline discussion, I think a major point of contention is storage objects vs. using protojson directly. I'm in favor of storage objects because I think they give us greater flexibility and allow us to obfuscate away protobuf specific things from our internal logic. However, if we use protojson, then we can get around needing to write custom marshalers/unmarshalers entirely.

I wanted to kick off a discussion about this. Thoughts @strideynet, @codingllama, @espadolini, @tigrato?

I'm +1 on storage objects, but I also have a suggestion. Something I've done a few times in the past is to keep a 1:1 mapping between API and storage initially, but then refactor to a distinct storage object once there's a reason to deviate. If the main concern is the cost of having separate representations, this is a way around that.

I do like the clarity we get from separate representations and don't mind the initial cost, specially for big distributed teams like Teleport, but I've also seen this argument go around enough times. Both alternatives are doable.

@codingllama
Copy link
Contributor

For reference, we had explicit storage definitions for device trust from the start: https://github.com/gravitational/teleport.e/blob/1961fb1e4ec54aab8f6ec014439acf83407ef690/lib/devicetrust/storage/types.go#L14-L25

All conversions are explicit, "regular" Go code. This is a simple example: https://github.com/gravitational/teleport.e/blob/1961fb1e4ec54aab8f6ec014439acf83407ef690/lib/devicetrust/storage/storage.go#L1426-L1443

These "unentangle" API protos from JSON tags (and Gogo), but a strong reason for them is to allow us to read, write and expire certain "parts" of the representation separately from others.

@mdwn
Copy link
Contributor Author

mdwn commented Jul 3, 2023

I'm +1 on storage objects, but I also have a suggestion. Something I've done a few times in the past is to keep a 1:1 mapping between API and storage initially, but then refactor to a distinct storage object once there's a reason to deviate. If the main concern is the cost of having separate representations, this is a way around that.

Can you give an example of what this might look like in practice?

@codingllama
Copy link
Contributor

I'm +1 on storage objects, but I also have a suggestion. Something I've done a few times in the past is to keep a 1:1 mapping between API and storage initially, but then refactor to a distinct storage object once there's a reason to deviate. If the main concern is the cost of having separate representations, this is a way around that.

Can you give an example of what this might look like in practice?

In Teleport terms we'd json.Marshal the proto directly to storage to begin with, like we do for most types now. Eventually you refactor into something like the device trust types I mentioned above - with strong parity to begin with, then greater liberty over time.

The main hurdle here is that we would have to move away from Gogo while keeping the JSON representation 100% compatible, which I imagine is going to be hard or awkward to do for some of the existing types. Selectively applying protojon and custom written types may be the way to go.

@mdwn
Copy link
Contributor Author

mdwn commented Jul 3, 2023

I'm +1 on storage objects, but I also have a suggestion. Something I've done a few times in the past is to keep a 1:1 mapping between API and storage initially, but then refactor to a distinct storage object once there's a reason to deviate. If the main concern is the cost of having separate representations, this is a way around that.

Can you give an example of what this might look like in practice?

In Teleport terms we'd json.Marshal the proto directly to storage to begin with, like we do for most types now. Eventually you refactor into something like the device trust types I mentioned above - with strong parity to begin with, then greater liberty over time.

The main hurdle here is that we would have to move away from Gogo while keeping the JSON representation 100% compatible, which I imagine is going to be hard or awkward to do for some of the existing types. Selectively applying protojon and custom written types may be the way to go.

So the general idea is: its up to the implementer to choose protojson or custom marshaling while maintaining backwards compatibility?

@espadolini
Copy link
Contributor

espadolini commented Jul 4, 2023

So the general idea is: its up to the implementer to choose protojson or custom marshaling while maintaining backwards compatibility?

More importantly, we can change that at the point when we have a reason to decouple the RPC-level representation from the storage-level representation.

@strideynet
Copy link
Contributor

Whilst I personally have a preference towards distinct go structs for storage, it's relatively easy for us to migrate to that in future on a per resource basis (assuming the behaviour of protojson is predictable enough to let us convert it to a Go struct without any mismatches). I imagine that allowing the use of the proto types for storage will be the path of least resistance here and encourage folks to follow the guidelines set out in this RFD. So with that in mind, I'm happy to support reusing the API types.

It'd definitely be nice to include within the RFD the strengths/weaknesses of both, and consider them both the "accepted implementations" going forward, as well as including guidance on when to move to pure Go structs and how to do so. Otherwise I think we risk having a bunch of different implementations across all the different packages - it'd be nice to have "legacy", "Proto Messages" and "pure Go struct" as the maximum number of variants 😓

@espadolini
Copy link
Contributor

An argument against using the same representation for the API level and the storage level storage is that at times it can be a bit of an antipattern to use the storage object in a Create/Upsert RPC whose intent is not "store this data in the backend" but "please create an entity in the cluster with this object as a loose template", as we end up only using a subset of the object and validating away or ignoring the rest.

Conversely, we can end up in a situation where the resource in the backend is implicitly a subset of the API level object, which is also confusing: UserV2 stores the .spec.local_auth field in a separate backend item, which gets marshaled/unmarshaled separately and transparently recombined if the appropriate options and permissions are set; splitting the backend representation from the API representation would make this obvious.

@mdwn
Copy link
Contributor Author

mdwn commented Sep 29, 2023

Hey all (@espadolini, @strideynet, @codingllama), I'd like to revive this and try to take it through to approval/it's natural conclusion so that we have something to reference going forward. I've made some updates since we last discussed this, do you all mind taking a look and see what you think about its current state?

Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main feedback, after seeing how some of the ideas played out in access lists, is to use the protos for communication between layers and limit the custom-written objects to storage. The "duped" types under api/types add significant boilerplate, contribute to the existence of the catch-all types package and cause some other problems of their own (like not being easily cloneable).

I realize this doesn't answer all the problems, but I'm not ready to adopt the RFD in its entirety yet.

// ResourceHeader will contain the version, kind, and metadata fields.
ResourceHeader

Spec *userGroupSpec `json:"spec" yaml:"spec"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does a spec provide? Can we get rid of that pattern too?

@mdwn
Copy link
Contributor Author

mdwn commented Nov 29, 2023

Hey all, just taking another look at this. I'll take a swing at this and update, but I'm thinking the thing to do here is to get rid of the resource header as a recommendation and then we can just use protojson as per @espadolini's earlier recommendations. That will get rid of the conversion functions as well. Thoughts?

@mdwn
Copy link
Contributor Author

mdwn commented Mar 9, 2024

I think much of this has been obviated by https://github.com/gravitational/teleport/blob/master/rfd%2F0153-resource-guidelines.md. I think at this point it will need a rewrite any way, so I'm going to close this.

@mdwn mdwn closed this Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants