Add OEP draft: distinguishing central agent

rancher · Jan 30, 2023 · 4849c8c · 4849c8c
1 parent 65b3b4f
commit 4849c8c
Show file tree

Hide file tree

Showing 2 changed files with 88 additions and 0 deletions.
diff --git a/enhancements/gateway/20220130-distinguishing-central-agent.md b/enhancements/gateway/20220130-distinguishing-central-agent.md
@@ -0,0 +1,88 @@
+# Distinguishing the central Opni cluster agent
+
+## Summary:
+This proposal describes a mechanism by which the agent running in the central opni cluster can be distinguished from other clusters.
+
+## Use case:
+This is a prerequisite for several other features. For example, identifying the central opni cluster would be required for implementing a global metrics tenant.
+
+## Benefits:
+The ability to distinguish the central cluster from other clusters provides benefits such as:
+- It allows establishing an increased level of trust with a specific agent, allowing it to be authorized for tasks that cannot be entrusted to other agents.
+- It allows the UI, CLI, or other API consumers to distinguish the central cluster from other clusters, which is helpful for users.
+
+## Impact:
+The addition of a distinguishing label to the central cluster agent will have no impact on existing functionality.
+
+## Implementation details:
+
+Determining which agent is in the central cluster is a surprisingly non-trivial task. There are several ways to do this, and lots of edge cases to consider.
+
+The most obvious solution would be to handle this at some point during the in-cluster bootstrap phase. To self-bootstrap, the agent running in the central cluster will take advantage of the fact that it has access to the gateway's management API to create its own token, then bootstrap with it. However:
+- From the gateway's perspective, this is no different than a user creating a token and configuring an agent themselves. It does not know/care who is making requests to the API.
+
+Another solution might be to assign the distinguishing label if and only if the agent's requested ID matches the gateway's own ID. However, this has other problems:
+- Because we are now going to treat the distinguished agent as privileged in some contexts, we would need a way to verify the ID they say they have. Until now, the IDs have been arbitrary, so this could introduce some weird new cases to consider. For now, we should continue to treat IDs as arbitrary and opaque.
+
+The notion that we should treat IDs as arbitrary and opaque is further supported by considering how other components of the system would determine whether an agent is the central agent or not. We don't want them to compare the agent's ID against an ID they procure themselves using outside information. Instead, we want them to check for the presence of a known label, and ignore the ID entirely.
+
+Instead of relying on a persistent label to identify a cluster, it might be better to use runtime-only information that is not stored persistently, but instead is an attribute of an agent's current session. This has some advantages:
+- Because distinguishing an agent is primarily intended to be used for *authorization* purposes, if the agent isn't connected to the gateway, it wouldn't need to be authorized to do anything anyway. The usefulness of knowing which agent *was* the central agent seems limited.
+- This eliminates the possibility that configuration issues could cause the central agent to be misidentified, or that the central agent could be impersonated by another agent.
+- This removes the need to keep track of the label and handle cases where it is removed or changed
+- This removes the need to expand on either the bootstrap or token management APIs.
+
+Therefore, the best course of action might be to have the distinguishing label be a property of an agent's status. The current status message looks like this:
+
+```protobuf
+message Status {
+  google.protobuf.Timestamp timestamp = 1;
+  bool connected = 2;
+}
+```
+
+We could add a new field containing session attributes:
+
+```protobuf
+message Status {
+  google.protobuf.Timestamp timestamp = 1;
+  bool connected = 2;
+  map<string, string> sessionAttributes = 3;
+}
+```
+
+This field could contain implementation-specific and/or well-known labels that can be used for a variety of purposes, including distinguishing the central agent.
+
+Implementing session attributes is far simpler by contrast:
+1. The agent adds an additional header to its connection request, containing a list of session attributes it wants to assert.
+2. For each attribute, the gateway sends an additional challenge to the agent during the authentication handshake. The agent must then compute a MAC for the challenge using an implementation-specific secret, and send the result back to the gateway.
+3. The gateway verifies the MACs and adds the attributes to the agent's status if they are valid. If any of the attributes are invalid, the agent is disconnected.
+
+An example implementation of the implementation-specific secret for the central agent could be:
+1. An octet key pair is generated by cert-manager and mounted into both the agent and gateway pods
+2. The location of this keypair on disk is configured in both the agent and gateway config file (or checked for in a well-known location)
+3. The presence of the keypair triggers the agent to assert the relevant attribute.
+
+Configuration could be either done via config files or environment variables, or both.
+
+## Acceptance criteria:
+- [ ] The central agent can be distinguished from other agents using a well-known label in its session attributes list, obtained from the management API.
+- [ ] An agent with an ID that is not equal to the gateway's ID can successfully be set as the central agent.
+- [ ] Two agents can simultaneously be labeled as central agents.
+
+
+## Supporting documents:
+
+## Dependencies:
+Any dependencies that must be met before the proposal can be implemented. This could include other projects, external systems, or third-party services that must be in place before the proposal can be completed.
+
+## Risks and contingencies:
+- If we do end up needing to know if a disconnected agent was labeled as the central agent, we can add session attributes to the "last known session details" metadata that is already tracked by the gateway for each agent connection. Will leave this feature out for now, since it could be confusing/redundant to have the attributes be available in two separate places.
+
+## Level of Effort:
+1 week total:
+Development: 2 days
+Testing: 2 days
+Docs: 1 day
+
+## Resources:
diff --git a/enhancements/gateway/README.md b/enhancements/gateway/README.md