Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0193 - Access Control Decision/PDP API #49132

Closed
wants to merge 2 commits into from

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented Nov 18, 2024

@fspmarshall fspmarshall added the no-changelog Indicates that a PR does not require a changelog entry label Nov 18, 2024
@fspmarshall fspmarshall force-pushed the fspmarshall/policy-decision-point-rfd branch 2 times, most recently from 56340d3 to fc2a491 Compare November 25, 2024 16:30
@fspmarshall fspmarshall force-pushed the fspmarshall/policy-decision-point-rfd branch 3 times, most recently from a51f264 to 015f6ee Compare November 27, 2024 15:39
@fspmarshall fspmarshall changed the title pdp api rfd draft RFD 0193 - Policy Decision Point API Nov 27, 2024
@fspmarshall fspmarshall changed the title RFD 0193 - Policy Decision Point API RFD 0193 - Access Control Decision API Nov 27, 2024
@fspmarshall fspmarshall added the rfd Request for Discussion label Nov 27, 2024
@fspmarshall fspmarshall marked this pull request as ready for review November 27, 2024 15:41
@fspmarshall fspmarshall changed the title RFD 0193 - Access Control Decision API RFD 0193 - Access Control Decision/PDP API Nov 27, 2024
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
Comment on lines +238 to +330
- The mechanism by which trusted cluster dials are performed will need to be reworked s.t. all routing decisions are made at
the leaf proxy rather than the root proxy since access-control decisions will now get made as a part of routing. This will include
sending sideband information about user identity when forwarding the dial from root proxy to leaf proxy in order to ensure that
the leaf proxy can correctly map the identity without relying on user certificate information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special care might need to be taken when dealing with agentless nodes in a leaf cluster - see #36831.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem in that issue seems to be that we're dialing agentless nodes from the root cluster for some crazy reason. Moving routing decisions to the leaf cluster will indirectly end up moving all agentless dials to the leaf cluster, which should fix that issue.

Comment on lines +246 to +337
- Agentless/openssh nodes are already using a "decisions at control plane" model, but they still depend on logins being present in
certificates. We may want to consider a mechanism for eliminating logins at the same time (e.g. by lazily provisioning certs containing
only the target login at the proxy).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this provide a solution to the too many principals issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the issue, yes. Lazily creating single-login certs is a good fix for that issue.

Copy link
Contributor

@espadolini espadolini Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy is wholly unrestricted in its ability to generate SSH certificates to log in to openssh nodes (and openssh doesn't have a way to enforce that a given user certificate is only for logging in to a specific server) so we might as well just use a user certificate with no principals (with a short duration and some embedded info about the teleport username and other data that might be useful for auditing).

rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
Note that this deviates somewhat from common PDP conventions in that we don't return a decision object wrapping the permit. Any result
other than allow must be communicated as an error.

- Common `RequestMetadata` and `PermitMetadata` types will be provided and should be included in the request and permit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Common `RequestMetadata` and `PermitMetadata` types will be provided and should be included in the request and permit
- Common `EvaluateRequestMetadata` and `PermitMetadata` types will be provided and should be included in the request and permit

so it's less generic

rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
}

message SSHAccessRequest {
RequestMetadata metadata = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RequestMetadata metadata = 1;
EvaluateRequestMetadata metadata = 1;

rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Show resolved Hide resolved

The relocate phase will see us actually transition to all access-control decisions being made at the control plane. The method
of relocation will depend on the kind of teleport agent. For tunnel agents, we will start performing denial at the control plane
itself, and allowed dials will have the Permit object forwarded to the agent as part of the dial. For direct dial agents, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use this change to also prohibit direct dial that bypasses the proxy? Or maybe deprecate direct dial altogether? 🥺

The Teleport SSHD listener is already specialized to at least opportunistically understand additional data signed by the proxy as part of client IP propagation, if we don't have to support openssh connecting directly to a server's 3022 we'd avoid the extra roundtrip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm certainly open to this idea. It would both simplify the conceptual model of direct dial agents going forward, and improve performance. Security model of such a change is something that would require some serious consideration. I'm wary of the idea of reusing the IP propagation mechanism. I'm rusty on the details of how that works, but IIRC that mechanism doesn't have the same encryption and replay-resistance properties one might expect from a TLS/SSH connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't have to support openssh connecting directly to a server's 3022 we'd avoid the extra roundtrip.

I'd ask that we don't take this decision too lightly. On the Machine ID team, we fairly regularly get folks asking for the ability to direct connect to a reverse tunnel agent! This is typically for two reasons:

  • Performance: Why make a trip out of the datacenter to where your Teleport Proxies are, when the two machines trying to connect to one another are in the same datacenter?
  • Performance: Why swamp your Proxy with large amounts of traffic (e.g copying a large binary from an Ansible master to a fleet of machines) and potentially impact other Teleport users?
  • Reliability: Ability for M2M SSH to succeed even if Proxy is unavailable/network is unreliable.

For the most part, we've been able to push people back on this and make perf improvements to Teleport itself - but - I do think one day we may hit a limitation here. It'd be nice to keep "machines" in mind here if possible!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefits of a PDP system are mostly negated if we still leave a mechanism in place for it to be bypassed (would mean agents still need to replicate and understand all access-control related resources, and we'd need to still restrict ourselves to only implementing features that are representable as a combination of cert state and roles).

If we don't preserve a mechanism for the PDP to be bypassed, then even if you direct dial an agent you still have to wait on a network round-trip to the control plane to get a decision from the PDP... so the reliability and perf points are mostly negated. There would still be potential benefits to long-running and/or heavy workloads (e.g. file transfer, or anything super latency sensitive) of course. IDK how strongly we weight those, but there might be alternative ways to address those cases that let us still adopt a more consistent PDP architecture. Ex:

  • Connection Upgrades: we could support a flow for upgrading a proxied connection into a direct connection under certain circumstances. This would ensure the PDP is always inline, but can permit connections to become out of line as appropriate.
  • Intrusive Proxies: this is already kinda possible in some cases, but we could improve support for running a local proxy inside of your datacenter that allows all datacenter-local operations to stay within the datacenter without the need to support multiple separate trust flows.

Copy link
Contributor

@strideynet strideynet Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefits of a PDP system are mostly negated if we still leave a mechanism in place for it to be bypassed

Agreed - I am not advocating for bypassing PDP. I am advocating for it being possible for the agent to act as the PEP for direct dial connections.

so the reliability and perf points are mostly negated.

I concede that yes, this does mean that PEP must be available for the client to be able to directly connect to the server - which negates the reliability aspect. However, I don't think this negates performance benefits? Common management tools like Ansible are pretty much the opposite of light-weight and are extremely noisy. At least in my experience so far, latency between the master and the server can make a significant difference in the amount of time that a playbook takes to run across hundreds or thousands of hosts. Ansible playbooks also tend to be "fragile" to failure. So even if the latency is not a major concern, the success rate of a large/complex ansible playbook is impacted by the connection travelling out to the proxy and back in.

IDK how strongly we weight those

I'm a little biased - I feel like we've been 'underweighing' these so far and that this has had negative impact on large deals and customer experience. If we don't consider "machine scale" more often in our decision making, then we're likely to continue introducing significant regressions which are unnoticeable at "human scale" but are show-stoppers for large customers. Being a little more proactive here is going to save us from dropping everything to be reactive for the next big use-case.

This perhaps isn't an argument for a specific choice/direction here - but just a more general plea that we try to think about this more.

alternative ways to address those cases that let us still adopt a more consistent PDP architecture

Definitely open to these ideas. I don't think I'm dying on the hill of "we must keep direct connections", just keen that we ensure that we're planning for this long-term.

rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
rfd/0193-access-control-decision-api.md Outdated Show resolved Hide resolved
Comment on lines +206 to +266
During this phase we want to start moving as many elements as possible from being part of the certificate-derived identity to
being calculated on-demand within the PDP itself. As a prerequisite to this, we will need to implement a custom access request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the client even need to present credentials in this new world if the SSH connection already comes with a stapled permit from the control plane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need a full major version where the agent can "fallback" to local authorization decisions if need-be I think, but past that point I believe the answer is yes.

@fspmarshall fspmarshall force-pushed the fspmarshall/policy-decision-point-rfd branch from c7ff6bf to 778aafb Compare December 4, 2024 16:51
@fspmarshall fspmarshall force-pushed the fspmarshall/policy-decision-point-rfd branch from 778aafb to 2a9144f Compare December 6, 2024 17:36
@fspmarshall fspmarshall force-pushed the fspmarshall/policy-decision-point-rfd branch from 2a9144f to b84c1de Compare December 6, 2024 17:38
Comment on lines +219 to +223
To solve this, permits will contain "feature assertions", which will specify which features *must* be available for the permit to be
properly enforced. Enforcing agents will then reject otherwise allowable actions if the permit contains one or more unknown features.
Feature assertions will be represented as a repeated enum for efficient encoding. Because feature assertions can be appended to permits
on a case-by-case basis, we minimize false-positives (instances where access gets denied due to an outdated enforcement point even though
the specific access did not merit the use of any new controls).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a concrete example in this section? I'll share one to check my understanding.

Let's say we add new feature where a role/policy can allow SSH port forwarding of only server ports or only client ports, and older PEPs only understand a binary allow/deny of port forwarding at all.

  • we'd add a feature assertion FEATURE_ASSERTION_ENHANCED_PORT_FORWARDING
  • the PDP would set this assertion only if the decision was to allow exactly one of server or client port forwarding
  • the older PEP (ssh agent) would have to completely disregard the permit if it doesn't understand the new permit, meaning it could not allow SSH access at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the same point.

A possible solution is to move the feature assertions into each individual option. For example:

message SSHAccessPermit {
  PermitMetadata metadata = 1;
  repeated string logins = 2;
  PortForwarding port_forwarding = 5;
  // ... (additional fields would go here)
}

message PortForwarding {
  repeated PortForwardFeatures features = 1;
  bool decision = 2;
}

This approach allows us to restrict specific parts of the protocol rather than the entire protocol. This would make it more practical to manage, even in cloud environments where agents are upgraded only after all tenants have completed their upgrades.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must also consider the decision-making process for agents running different versions while serving the same resource, such as multiple Kubernetes agents advertising the same resource but operating on different versions. This scenario often occurs during upgrade windows when agents are updated one at a time, a process that can take several minutes to complete. In some cases, it may even require new Kubernetes nodes to be created and prepared to host the updated replica.

As a result, when the Policy Decision Point (PDP) evaluates whether access is permitted, it must account for all potential agent versions. This is critical because proxies randomly select their targets when multiple replicas are available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution is to move the feature assertions into each individual option.

I don't think we need to do that to get granularity, the features are already meant to be granular. There's little difference between a "global" _ENHANCED_PORT_FORWARDING vs a "localized" one, but the latter means the features are spread out through the definitions and more messages need to change.

There are also compatibility problems with adding features to a message that didn't carry them before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between global and localized feature protection lies in how an agent handles unsupported features. If the agent is running an older version that doesn't support a newly introduced feature, a global enum representing the feature will be unknown to it.

Consider this scenario:

  • The control plane is running version vx.y.z.
  • The agent is running version vx.y.(z-2).
  • In version vx.y.(z-1), a new feature—enhanced port forwarding—was introduced, along with its corresponding feature protection in the protocol.

Since the agent is running a version older than the one where the feature flag was added, it will treat the new enum value as just an unrecognized number with no context about the associated feature. As @nklaassen pointed out, in such cases, the agent’s only safe course of action is to block the entire protocol—for example, preventing SSH access to the server. This precaution is necessary because the unknown flag could control any critical SSH feature, such as port forwarding, forward agent, allowed commands... Without understanding the feature, the secure approach is to deny access entirely.

If features are localized, however, the unknown feature protection would be specific to port forwarding. In this case, the agent could make a more targeted decision: it would allow SSH connections but deny all port forwarding requests, since it cannot interpret the flag value.

In certain scenarios, ensuring correctness requires blocking SSH access, which makes a global flag necessary. With the decision now centralized, the Policy Decision Point (PDP) must either have full knowledge of the features supported by each agent version or we need to design a system that enables the agent to make appropriate decisions based on feature protection flags.

Previously, the authentication server handled these decisions by blocking protocol access when needed—e.g., downgrading roles and applying deny labels like {'*': '*'}. However, this approach was not scalable, and there have been instances in the past where features were introduced without properly restricting access when agents did not support them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this falls here:

  • If possible, implement new features s.t. it is safe for older PEPs to ignore them

Features are a last resort - if whatever we want can be done as you suggest than it's a much better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PDP could set the legacy PortForwarding boolean to false (if any port forwarding option is denied), and set the new enhanced port forwarding options correctly, and that way old PEPs will fail-safe and never allow any port forwarding.

I'm concerned that we'll end up with multiple flags controlling the same feature, making it harder for the agent to determine whether access was denied due to roles or because the agent doesn't support the feature. When access fails, people typically check the agent logs. If the old port-forwarding setting is set to false and the agent cannot interpret the new enhanced port-forwarding setting, it will assume that access was denied by roles instead of indicating that the roles required an unsupported feature.

From a UX perspective, the operation is a nightmare specially for cloud customers that do not have access to proxy or auth server logs. What they will see is the agent denying port-forwarding with no logs attached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDP could send an informational message to the agent when fallbacks are in effect. Would that solve this concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that mean PDP would need to be aware of all possible version features, or that we would have to send that information each time? If it's the latter, wouldn't that be essentially the same as directly sending the feature as enum/json payload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the RFD with more concrete examples that I hopefully flush it out a bit more.

While I see why per-control feature assertions do have nice qualities, I'm wary to recommend them as I think they would introduce a lot of extra noise. The permits are already going to end up being fairly large/complex, and adding dedicated assertion fields for every individual control just in case we decide to make it more granular in the future strikes me as iffy. I think that the idea of leaving the old field in the "off" position and introducing a new one when a single control gets split into more granular sub-controls is likely going to be much cleaner in the long run, even if slightly less expressive. After all, we don't do that kind of thing very often, and we can always deprecate the old field after a few major versions to keep the permit from growing indefinitely.

provided for basic permit/denial control-flow. For example, a `decision.Unwrap` function will be provided that converts decision responses
into the form `(Permit, error)` where `error` is a `trace.AccessDeniedError` built from the contents of the denial.

- A standard `Resource` reference types with fields like `kind` and `name` will be provided and should be prefered in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are scenarios where a sub-resource condition is required for further analysis, such as filtering Kubernetes groups based on kubernetes_resources.

For example:
A user may have system:masters access to a specific Kubernetes namespace but only read permissions for other namespaces.

This is achievable only if:

  • At evaluation time, the relevant sub-resource is provided to enable filtering. However, this isn't always feasible—for instance, when filtering resources, the target is not always known.
  • The response includes a list of repeated oneof decision { SSHAccessPermit permit = 1; SSHAccessDenial denial = 2; } objects, containing all possible combinations derived from roles. These objects will then be forwarded to the agents for further processing. Each entry in the array defines the set of conditions you want to apply kubernetes_groups for a particular kubernetes_resources match...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about passing down the kubernetes resources as part of the permit? I would rather not have to deal with multiple permits or do the combinatory explosion at the control plane.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where the kubernetes_groups are deeply tied together with kubernetes_resources.

For instance, when you run kubectl delete pods --all --namespace <namespace>, a request is made to DELETE /api/v1/namespaces/<namespace>/pods.

Given that kubernetes_resources might restrict the resources you can access, Teleport transforms the request into:

  • GET /api/v1/namespaces/<namespace>/pods
  • It filters the pods the user doesn't have access
  • For each pod, it computes the kubernetes_groups the user is allowed to use. This requires ${kubernetes}\textunderscore{labels} \cap {kubernetes}\textunderscore{resources}$ for each POD name, namespace and verb. For this, the PEP must know the possible set of space $\mathbb{D}({kubernetes}\textunderscore{resources}) \rightarrow {kubernetes}\textunderscore{groups}$
  • Finally, for each pod it executes the request DELETE /api/v1/namespaces/<namespace>/pods/<pod_name>

Because of this, it requires a more complex representation of permits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced that k8s complexity should warrant the change from single to multi-permit per request. What alternatives do you see that would allow us to keep a single permit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update the K8SPermit structure as follows:

message K8SPermit {
    repeated KubeResourcesGroupMappings mappings = 1;

    // Note: No kubernetes_users or kubernetes_groups directly under this message
}

message KubeResourcesGroupMappings {
    repeated string kubernetes_groups = 1;
    repeated string kubernetes_users = 2;
    repeated KubernetesResources kubernetes_resources = 3;
}

message KubernetesResources {
    string kind = 1;
    string api = 2;
    string name = 3;
    optional string namespace = 4;
    repeated string verbs = 5;
}

This will become more complex to manage if we introduce other restrictions like k8s_resources (not planned) because requires changing the messages and possibly introducing more fields while in permits it would be "easier"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking a bit more about this problem, I think we will need to have multiple permits (allow/deny) or duplicate the mappings. This is necessary because users might have denied kubernetes_groups for specific kubernetes_resources.

Like:

allow:
    kubernetes_groups: ["write","read"]
    kubernetes_resources:
       - kind: *
         name: *
         namespace: *
         verbs: *
deny:
    kubernetes_groups: ["write"]
    kubernetes_resources:
       - kind: namespace
         name: kube-system
         verbs: *

This would require us to change the messages to allow multi permits or duplicate mappings like bellow

message K8SPermit {
    repeated KubeResourcesGroupMappings allow_mappings = 1;
    repeated KubeResourcesGroupMappings deny_mappings = 1;
    // Note: No kubernetes_users or kubernetes_groups directly under this message
}

Comment on lines +219 to +223
To solve this, permits will contain "feature assertions", which will specify which features *must* be available for the permit to be
properly enforced. Enforcing agents will then reject otherwise allowable actions if the permit contains one or more unknown features.
Feature assertions will be represented as a repeated enum for efficient encoding. Because feature assertions can be appended to permits
on a case-by-case basis, we minimize false-positives (instances where access gets denied due to an outdated enforcement point even though
the specific access did not merit the use of any new controls).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the same point.

A possible solution is to move the feature assertions into each individual option. For example:

message SSHAccessPermit {
  PermitMetadata metadata = 1;
  repeated string logins = 2;
  PortForwarding port_forwarding = 5;
  // ... (additional fields would go here)
}

message PortForwarding {
  repeated PortForwardFeatures features = 1;
  bool decision = 2;
}

This approach allows us to restrict specific parts of the protocol rather than the entire protocol. This would make it more practical to manage, even in cloud environments where agents are upgraded only after all tenants have completed their upgrades.

Comment on lines +219 to +223
To solve this, permits will contain "feature assertions", which will specify which features *must* be available for the permit to be
properly enforced. Enforcing agents will then reject otherwise allowable actions if the permit contains one or more unknown features.
Feature assertions will be represented as a repeated enum for efficient encoding. Because feature assertions can be appended to permits
on a case-by-case basis, we minimize false-positives (instances where access gets denied due to an outdated enforcement point even though
the specific access did not merit the use of any new controls).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must also consider the decision-making process for agents running different versions while serving the same resource, such as multiple Kubernetes agents advertising the same resource but operating on different versions. This scenario often occurs during upgrade windows when agents are updated one at a time, a process that can take several minutes to complete. In some cases, it may even require new Kubernetes nodes to be created and prepared to host the updated replica.

As a result, when the Policy Decision Point (PDP) evaluates whether access is permitted, it must account for all potential agent versions. This is critical because proxies randomly select their targets when multiple replicas are available.

decision making (namely roles). In some larger clusters, role replication alone can account for a significant portion
of all load incurred after an interruption of auth connectivity.

- Compatibility: Because teleport access-control includes the ability to specify unscoped deny rules, there is no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since PEP makes decisions based on agent heartbeats, it creates a scenario where the same resource can be served by different agents. If these agents report differing labels in their heartbeats, the decision may vary depending on which resource instance is analyzed.

Previously, each agent enforced its rules based on the labels defined in its configuration or memory. This ensured that even if agent A would allow the connection, if the request was forwarded to agent B whose labels differ, agent B will deny the request.

Using heartbeats also creates another dangerous scenario where an attacker compromised the credentials required to send heartbeats or modify resources in the database, PEP can allow the operation based on compromised labels but the agents acting as the target weren't compromised. This would create a scenario where access could be modified and live agent's will accept it.

To maintain this integrity, we need to share the fingerprint of the resource labels with the PEP. This way, the PEP can deny requests if the fingerprint does not match, ensuring reliable and secure enforcement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a proxy peering scenario, who is the proxy that calls the PEP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The least impactful way would be that the policy document is obtained by the client-side proxy and passed to the agent-side proxy (as part of the proxy peering dial request) to be sent over the reverse tunnel.

@fspmarshall
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants