Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy does not adhere to HTTP/2 RFC 7540 #6767

Open
trevorlinton opened this issue May 1, 2019 · 52 comments
Open

Envoy does not adhere to HTTP/2 RFC 7540 #6767

trevorlinton opened this issue May 1, 2019 · 52 comments
Assignees
Labels
area/http area/tls design proposal Needs design doc/proposal before implementation help wanted Needs help!

Comments

@trevorlinton
Copy link

Title: Envoy does not adhere to HTTP/2 RFC 7540

Description:

RFC 7540 Section 9.1.1 and 9.1.2 specifies when a request coming in through a re-used HTTP/2 connection is accidentally sent to a non-origin but authoritative server that a 421 response should be returned. This can happen if two servers one with a wildcard certificate (e.g., a.example.com) and another server (b.example.com) with a non-wildcard on the same IP address using SNI responds to requests those meant for server b.example.com will accidentally be forwarded down the re-used HTTP/2 connection for a.example.com. In this situation a.example.com should send back a 421 to indicate the request was destined for b.example.com. This forces browsers to re-establish a new connection, re-negotiate the SNI, and thus the backing server and subsequently route to the correct origin.

[optional Relevant Links:]

@mattklein123 mattklein123 added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels May 2, 2019
@mattklein123
Copy link
Member

@alyssawilk @PiotrSikora thoughts on this? I haven't read the relevant RFCs in detail to fully understand what is needed.

@alyssawilk
Copy link
Contributor

Intersection of H2 specs and TLS handshake? I eagerly anticipate Piotr sorting this out :-P

@PiotrSikora
Copy link
Contributor

The gist is that browsers coalesce HTTP/2 connections pretty aggressively: when a browser opens connection to www.example.com and is presented with a certificate for *.example.com during TLS handshake, then it will re-use this connection for all requests to *.example.com, as long as the hostname resolves to the same IP (and some browsers don't even care about that).

As long as all *.example.com hostnames are served by the same listener/filter chain, this shouldn't be an issue in Envoy, since routing is happening on a per-request, and not per-connection basis (please correct me if I'm wrong).

However, if www.example.com (with *.example.com certificate) is served by one listener/filter chain, and app.example.com is served by another listener/filter chain, then we have an issue, since connections are latched to a single listener/filter chain for the lifetime of the connection (again, please correct me if I'm wrong), and if the connection to www.example.com is established first, then requests to app.example.com will be coalesced on the same connection, using configuration for www.example.com, and forwarded to the wrong backend.

One solution would be to send 421 Misdirected Request response to requests for hostnames that are not configured on a given listener/filter chain (but this wouldn't work if *.example.com is configured), or send 421 Misdirected Request response to requests for hostnames that are configured on other listeners/filter chains (but this requires a global list of all configured hostnames).

Another solution would be using HTTP/2 ORIGIN frame (RFC8336) to advertise allowed hostnames on a given listener/filter chain (but this requires a global list as well, and this extension is supported only by a few clients).

@jcrowthe
Copy link

Is is possible to reprioritize this issue? We have a use case where we have thousands of services behind hundreds of FQDN's that are served by a set of identical envoys, (all using a wildcard TLS cert). We exhibit this exact issue when HTTP/2 is enabled but not when enforcing usage of HTTP/1.1.

@haf
Copy link

haf commented Apr 15, 2020

Here's the CVE for this vulnerability https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11767

@htuch
Copy link
Member

htuch commented Apr 15, 2020

CC @envoyproxy/security-team

@jpeach
Copy link
Contributor

jpeach commented Apr 28, 2020

One solution would be to send 421 Misdirected Request response to requests for hostnames that are not configured on a given listener/filter chain (but this wouldn't work if *.example.com is configured), or send 421 Misdirected Request response to requests for hostnames that are configured on other listeners/filter chains (but this requires a global list of all configured hostnames).

I can think of 3 alternatives:

  1. What if you could specify the HTTP response code for an RBAC filter DENY? Then the management server that configured the HCM can add an RBAC policy for the server names that it allows on that HCM and generate 421 on DENY.

  2. The management server could program the SNI server name check in the Lua filter, generating a 421 when it didn't match.

  3. Add a dedicated filter that can be configured with the acceptable SNI server names.

@HarinadhD
Copy link

What is the plan for fixing https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11767 in envoy?

@mattklein123
Copy link
Member

AFAIK there is no plan currently. Someone needs to own this issue and drive a resolution if they are passionate about fixing it.

@jpeach
Copy link
Contributor

jpeach commented Jul 17, 2020

I hacked up a 421 response when the virtual host lookup fails and this works for a simple case that I tried. If this is a reasonable approach I'd need a bit of help to polish it up and figure out how to write tests for it.

https://gist.github.com/jpeach/e01f5f752eed5ffd09ea1f18634d1fc5

@lunighty
Copy link

lunighty commented Sep 7, 2020

I think I managed to find a workaround:

On Envoy instance I added an "envoy.lua" HTTP filter, that checks if the response code is a 404 (the same code, that is being generated for non-existent route) AND checks if the "x-envoy-upstream-service-time" header is NOT present.

The Lua code:

function envoy_on_response(response_handle)
    if response_handle:headers():get(":status") == "404" and response_handle:headers():get("x-envoy-upstream-service-time") == nil then
        response_handle:headers():replace(":status", "421")
    end
end

Example configuration on Envoy (fetched by LDS):

"http_filters": [
    {
        "name": "envoy.lua",
        "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua",
            "inline_code": `
function envoy_on_response(response_handle)
    if response_handle:headers():get(":status") == "404" and response_handle:headers():get("x-envoy-upstream-service-time") == nil then
        response_handle:headers():replace(":status", "421")
    end
end`
        }
    },
    {
        "name": "envoy.filters.http.router"
    }
]

@jpeach
Copy link
Contributor

jpeach commented Sep 7, 2020

I think I managed to find a workaround:

On Envoy instance I added an "envoy.lua" HTTP filter, that checks if the response code is a 404 (the same code, that is being generated for non-existent route) AND checks if the "x-envoy-upstream-service-time" header is NOT present.

Nice! That's a bit cleaner than my equivalent 👍

@GoelDeepak
Copy link

@jpeach @lunighty Is there a plan to have a fix in Envoy code or do you think the current approach with EnvoyFilter is sufficient?

@howardjohn
Copy link
Contributor

howardjohn commented Feb 10, 2021

May be a dumb question - if we have a scenario like

However, if www.example.com (with *.example.com certificate) is served by one listener/filter chain, and app.example.com is served by another listener/filter chain, then we have an issue, since connections are latched to a single listener/filter chain for the lifetime of the connection (again, please correct me if I'm wrong), and if the connection to www.example.com is established first, then requests to app.example.com will be coalesced on the same connection, using configuration for www.example.com, and forwarded to the wrong backend.

(from Piotr above)

How can we distinguish from a browser coalescing the request from someone legitimately sending a request with SNI=ww.example.com HOST=app.example.com? It seems like from Envoy's perspective, these are identical.

Also, RFC 7540 is about HTTP/2. The above example can be done with HTTP 1 - do we expect a 421 still?

@lambdai lambdai self-assigned this Feb 10, 2021
@lambdai
Copy link
Contributor

lambdai commented Feb 11, 2021

I am exploring the relationship among SNI, SAN in cert and Host in Http.
I will share a document after

@maennchen
Copy link

maennchen commented May 4, 2021

We now use this Lua Snippet:

function envoy_on_request(request_handle)
  local streamInfo = request_handle:streamInfo()
  if streamInfo:requestedServerName() ~= "" then
    if (string.sub(streamInfo:requestedServerName(), 0, 2) = "*." and not string.find(request_handle:headers():get(":authority"), string.sub(streamInfo:requestedServerName(), 1))) then
      request_handle:respond({[":status"] == "421"}, "Misdirected Request")
    end
    if (string.sub(streamInfo:requestedServerName(), 0, 2) ~= "*." and streamInfo:requestedServerName() ~= request_handle:headers():get(":authority")) then
      request_handle:respond({[":status"] = "421"}, "Misdirected Request")
    end
  end
end

EDIT: Fixed for HTTP requests where requestedServerName is empty
EDIT 2: Fixed for Wildcard SNI described by @lambdai

@kyessenov
Copy link
Contributor

(random passer by) @htuch This is an example of a very valuable and simple use case that needs Lua.

@htuch
Copy link
Member

htuch commented Jan 19, 2024

@kyessenov ack. Presumably Wasm could also work here at the expense of a heavier weight everything.

@PiotrSikora
Copy link
Contributor

(random passer by) @htuch This is an example of a very valuable and simple use case that needs Lua.

This is such an integral part of HTTP/2, that it should be fixed in Envoy core, instead of requiring users to implement workarounds in Lua/Wasm.

@htuch
Copy link
Member

htuch commented Jan 23, 2024

I think the meta-point here is that Lua/Wasm are useful temporary patches until a release with a fix can be deployed.

@kyessenov
Copy link
Contributor

@PiotrSikora yeah I agree. @lambdai do you still want to work on it, or should we leave it open for grabs?

@ldemailly
Copy link

temporary patch

opened this issue on May 1, 2019

(yes yes, inb4 why don't you make an MR ;-) )

@a-b-v
Copy link

a-b-v commented May 15, 2024

It's not an envoy problem, but a wrong envoy config. The problem can be solved using standard means

- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: https
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 443
      protocol: TCP
  listener_filters:
  - name: "envoy.filters.listener.tls_inspector"
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
  filter_chains:
  - name: https-domain-with-sni-and-wilcard-certificate
    filter_chain_match:
      server_names:
      - "a.domain.com"
    filters:
    - name: https-domain-with-sni-and-wilcard-certificate
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: https-domain-with-sni
        route_config:
          name: https-domain-with-sni-and-wilcard-certificate
          virtual_hosts:
          - name: 'a.domain.com'
            domains:
            - "a.domain.com"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: upstream
             - name: 'other_domains'
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              direct_response:
                status: 421
        http_filters:
        - name: https-domain-with-sni-and-wilcard-certificate
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
    transport_socket:
      name: https-domain-with-sni-and-wilcard-certificate
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
        common_tls_context:
          tls_certificates:
          - certificate_chain:
              filename: wildcard.crt
            private_key:
              filename: wildcard.key

@howardjohn
Copy link
Contributor

@a-b-v you (probably) don't want to 421 unless its a valid domain on a different SNI. Otherwise you should 404.

@a-b-v
Copy link

a-b-v commented May 15, 2024

imho if such a request is received, it's exactly exists valid domain suitable for the wildcard certificate on a different SNI. Or it's an attack and I don't care that to answer)

@a-b-v
Copy link

a-b-v commented May 16, 2024

anyway if a client receives 421, it must (MAY according to RFC) retry the request over a different connection. and this connection will have the correct SNI and a server will return 404 if no domain exists

@keithmattix
Copy link
Contributor

We haven't had a ton of movement on this; are we still looking for an owner?

@ggreenway
Copy link
Contributor

Yes, needs an owner and a viable proposal for how to fix (without breaking something else).

@keithmattix
Copy link
Contributor

I think @lambdai's first option makes the most sense:

HCM doesn't verify requested server name(SNI) with :authority. Theoretically, the first http stream in within the connection can raise the issue.
Should we provide an option in HCM to return 421 on the above conflict?

I'll probably need some help as I'm still onboarding to envoy' but I'd be happy to drive this forward next release, especially since it looks like a CVE has been filed against this behavior

@keithmattix
Copy link
Contributor

/assign @keithmattix

@ggreenway
Copy link
Contributor

I think the easiest solution is an option on the HCM to enforce matching SNI and :authority and return 421 if they don't match. That's not technically a full solution, because it may be valid to reuse a connection for some subdomains and not others, and this would preclude some valid connection reuse, but it would force everything into a working state, be straightforward to implement, and easy to understand and reason about.

@keithmattix
Copy link
Contributor

Yep that's exactly what I'm planning to implement. Optional flag on HCM for those who want stricter :authority -> SNI checking.

@howardjohn
Copy link
Contributor

I think this will both break traffic that would otherwise work entirely, even outside of HTTP/2 re-use, result in a decrease in connection pooling, and incorrectly lead to clients retrying when they shouldn't since it was a legitimate 404 not a misdirect (hopefully the browsers are smart enough not to infinite loop!). A flag to do so... sure. But I wouldn't turn it on in Istio, for example.

@keithmattix
Copy link
Contributor

@howardjohn what's the alternative? IIUC, each HCM would have to know about all of the other listeners/FCMs to distinguish between a true 404 vs. a retriable connection re-use error

@ggreenway
Copy link
Contributor

With a list of all hostnames that are currently supported, it would be easy with the current capabilities to only return a 421 when appropriate: just add all the "wrong here but supported" to another vhost that has a direct_response 421, and then have a wildcard vhost that returns 404.

Or am I missing something in the details? There are a lot of comments between this issue and the Istio one; I may be missing something important.

@keithmattix
Copy link
Contributor

@ggreenway not sure I completely understand your point. So each filter chain (match( data structure would have a list of all of the hostnames within its listener so it can know if the request is valid?

@ggreenway
Copy link
Contributor

I think what @howardjohn is asking for is if a request for a known hostname but on the wrong filter chain arrives, respond with 421, and if a request for an unknown hostname arrives, respond with 404.

Assuming that is correct, this can be configured by adding routes (via VirtualHost) in each HCM (across all filter chains) for all known hostnames that are not correct for the current HCM, and responding with a 421 from those routes.

@keithmattix
Copy link
Contributor

Ahh I understand now; yes I think that's plausable

@ryanobjc
Copy link

Hey everyone, glad to see some movement and excitement on this issue - that's great!

As someone who is affected by this bug, this is my workaround: certs are single domains only.

So connection pooling is already limited since clients cannot pool across all the hostnames that are served by the same envoy/istio instance. The problem I ran into is I had some certs with N domains, and some with 1 and it was all intermixed on a single load balancer envoy. You might be saying "well why don't you use a wildcard then?" and I would say "I am using lets encrypt which discourages/makes wildcard certs harder to use."

So I have N certs for N subdomains all under .foo.bar.com and that is how I like it (maybe, but with certmanager tooling it certainly isn't oppressively hard to manage).

@ldemailly
Copy link

ldemailly commented Jul 18, 2024

letsencrypt rightfully has quotas on certs and if you do generate a lot of urls, say for a sandbox environment ($username-$servicename.sandbox.our.domain), using wildcard is actually what is recomended/needed to not block your 'real' certs... and that's where this breaks

@johnlanni
Copy link
Contributor

@keithmattix @lambdai @zhaohuabing @arkodg
I previously considered placing domains with the same certificate in the same filter chain and putting the VirtualHost configuration for each domain in the HCM's RDS. However, I later found that this was not a good approach, mainly because:

  1. Increased the management complexity of the control plane, for example, it is difficult to maintain a simple mapping relationship between Gateway resources (Gateway API standard) and filter chain.
  2. When the domain's certificate changes, it causes LDS to rebuild, resulting in downstream connections being disconnected.

I propose an idea that is currently implemented in the Higress fork of Envoy. Let's see if it's OK, I can submit this implementation as a PR:

  1. Based on the SRDS mechanism, it supports dividing the scope according to the :authority request header and extends this to support wildcard domain lookup (through iterative lookup).
  2. This way, all HCMs in the filter chain share the same SRDS configuration, decoupling the SNI matching process from the VirtualHost lookup process.
  3. Special handling is required for mTLS scenarios, which can be achieved by adding an allow_server_names configuration to the HCM to prevent access to mTLS-protected VirtualHosts using unexpected SNI.

This architecture is actually similar to the implementation in Nginx, where in Nginx, SNI is used first to find the server block to complete the TLS handshake, and then the Host request header is used to find the server block to apply the HTTP routing. And during this process, special handling is done for mTLS scenarios:
nginx/nginx@b720f65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/http area/tls design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.