Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale-able direct pod addressibility #36596

Open
howardjohn opened this issue Dec 21, 2021 · 3 comments
Open

Scale-able direct pod addressibility #36596

howardjohn opened this issue Dec 21, 2021 · 3 comments
Labels
area/networking lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@howardjohn
Copy link
Member

Currently, Istio identifies traffic:

  • TCP/TLS: Must have VIP

  • HTTP: Must have Service name in Host header

  • Auto: Must have VIP (even if HTTP, hostname is not used)

  • Headless TCP/TLS: Match any Pod IP as a dedicated listener

  • Headless HTTP: match service name in host header or *.<service>

  • headless auto: Match any Pod IP as a dedicated listener (even if HTTP, hostname is not used)

There are two main problems here:

  • Headless is not scale able, or as well supported as other cases. We create a listener per pod; in large headless services (typically DaemonSets), this leads to massive XDS payloads. On 300 node clusters with a few headless services we saw 15mb LDS configs. Headless services also typically do not set a host header, causing them to only work when using auto protocol - we prefer users to name the protocol typically. Headless also has weak auto mtls support - it requires a homogeneous cluster (all mtls or none).
  • Direct pod connections are not identified, and thus do not have mTLS (among other issues). There are some workarounds like making a orig_dst dest rule, but it only works with HTTP and has the same issues as headless for mtls. See Make pod addressability work even in meshes and drop fallbacks knative/serving#10751

This issue tracks what we can do to improve this.

@howardjohn
Copy link
Member Author

howardjohn commented Dec 21, 2021

For pod connections, one solution is a smarter PassthroughCluster. Today, it is just an original_dest cluster, so we blindly pass everything through. Ideally, what we have is:

if destinationIsAPod() { 
  forwardWithMTLS()
} else {
  passthrough()
}

As far as I know, there is no scale-able way to do this today. My assumption is that to scale, we need to provide Envoy with the set of pod IPs in EDS (or equivalent), and likely duplicating IPs no more than once.

One thing that is close is using Envoy endpoint subsets. For example:

  clusters:
  - name: Passthrough
    type: STATIC
    lb_subset_config:
      subset_selectors:
      - keys:
        - address

    load_assignment:
      cluster_name: Passthrough
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 1.2.3.4
                port_value: 80
          metadata:
            filter_metadata:
              envoy.lb:
                address: "1.2.3.4:80"
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 2.3.4.5
                port_value: 80
          metadata:
            filter_metadata:
              envoy.lb:
                address: "2.3.4.5:80"

Then add a network filter that sets envoy.lb.address metadata to be equal to the original destination (I didn't find a way to do this built into Envoy, but its a trivial filter to write).

What this does is allow us to passthrough requests, like orig_dst, but still attach metadata to them. Importantly, this metadata can be our tlsMode metadata and used for a transport_socket_match, allowing upgrading passthrough requests to mTLS. For multinetwork with IP conflicts, we will favor the local network, as the requests to direct pod IPs will not traverse the network gateway so must be the local IP.

What is missing here is the fallback if there is no match. Envoy does have a fallback_policy, but it only allows picking any endpoint or failing; what we want is "fallback to orig_dst".

Just doing this alone could get us mTLS but we will still have degraded telemetry and likely other features. To take it a step further, we could consider adding explicit clusters for some groups of pods. What the groups are is open to discussion - it could be Services (and pick one if there are multiple...? we do this for headless), canonical service, or something else. From there, we could add a FCM extension that could match based on EDS data. For example, to implement headless services we could do:

listeners:
- filter_chain_match:
    from_eds: outbound|80|headless
  filters:
   set_dest_addr_metadata: {}
   tcp_proxy:  outbound|80|headless
endpoints:
 - cluster_name:  outbound|80|headless
   subset_config: { ... address subset config ... }
   endpoints: { .. same as above, with address metadata ..}

What this would do is look at all pod IPs/ports in the outbound|80|headless EDS response. If a request matched these, the filter chain matches. From there, we send to this cluster, and select the original destination endpoint. Because of the associated metadata, we can selectively enable mTLS as we do with Services

related: envoyproxy/envoy#15750

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Mar 22, 2022
@howardjohn
Copy link
Member Author

Not stale

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Mar 22, 2022
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 21, 2022
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 6, 2022
@howardjohn howardjohn added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed and removed lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. labels Jul 15, 2022
@howardjohn
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
None yet
Development

No branches or pull requests

2 participants