Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect with Envoy: gRPC over HTTPS in a new cluster only works when CONSUL_HTTP_SSL is true #7473

Closed
akhayyat opened this issue Mar 20, 2020 · 5 comments · Fixed by #7608
Closed
Labels
type/bug Feature does not function as expected

Comments

@akhayyat
Copy link

akhayyat commented Mar 20, 2020

Overview of the Issue

When using Connect with Envoy, Envoy uses HTTPS to connect to the Consul agent for the first time only when CONSUL_HTTP_SSL is true, even if CONSUL_HTTP_ADDR is set to use HTTPS. After it has connected for the first time (with CONSUL_HTTP_SSL enabled), it is able to connect again even without setting CONSUL_HTTP_SSL.

Reproduction Steps

  1. Create a cluster with HTTP (8500), HTTPS (8501), and GRPC (8502) ports enabled, and with Connect enabled

  2. Define a service with a sidecar proxy configured to use CONSUL_HTTP_ADDR=https://127.0.0.1:8501 (without setting CONSUL_HTTP_SSL)

  3. Envoy fails to connect to the Consul agent:

    Agent log:

     agent: grpc: Server.Serve failed to complete security handshake from "127.0.0.1:50030": tls: first record does not look like a TLS handshake
    

    Envoy log:

     router decoding headers:
     ':method', 'POST'
     ':path', '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
     ':authority', 'local_agent'
     ':scheme', 'http'
     'te', 'trailers'
     'content-type', 'application/grpc'
     'x-consul-token', ''
     'x-envoy-internal', 'true'
     'x-forwarded-for', '192.168.121.186'
     ...
     Unknown error code 104 details Connection reset by peer
    

When CONSUL_HTTP_SSL is set to true, the ':scheme', 'http' line becomes ':scheme', 'https'.

Here is a full Ansible playbook that reproduces this problem:
https://gist.github.com/akhayyat/4a6a5718425ac4addfef3fa9bb932c65

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 2
        services = 2
build:  
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul: 
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.121.170:8300
        server = true
raft:   
        applied_index = 61
        commit_index = 61
        fsm_pending = 0
        last_contact = 21.84253ms
        last_log_index = 61
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 104
        max_procs = 1
        os = linux  
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 2
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 3 
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 3 
        query_queue = 0
        query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 2  
        services = 2
build:
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.121.170:8300
        server = true
raft:
        applied_index = 73
        commit_index = 73
        fsm_pending = 0
        last_contact = 38.422636ms
        last_log_index = 74
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 104
        max_procs = 1
        os = linux
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 3 
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 3 
        query_queue = 0
        query_time = 1

Operating system and Environment details

Debian 10, amd64.

@jsosulska
Copy link
Contributor

Hi @akhayyat
This looks to be an actual bug, and this is not working as expected. Tagging for the team to look through :)

Thank you for finding this!

@jsosulska jsosulska added the type/bug Feature does not function as expected label Apr 7, 2020
@dnephin
Copy link
Contributor

dnephin commented Apr 7, 2020

I had a quick look at the code, and here is what I found:

// Decide on TLS if the scheme is provided and indicates it, if the HTTP env
// suggests TLS is supported explicitly (CONSUL_HTTP_SSL) or implicitly
// (CONSUL_HTTP_ADDR) is https://
useTLS := false
if strings.HasPrefix(strings.ToLower(c.grpcAddr), "https://") {
useTLS = true
} else if useSSLEnv := os.Getenv(api.HTTPSSLEnvName); useSSLEnv != "" {
if enabled, err := strconv.ParseBool(useSSLEnv); err == nil {
useTLS = enabled
}
} else if strings.HasPrefix(strings.ToLower(httpCfg.Address), "https://") {
useTLS = true
}

c.grpcAddr will default to the value of this environment variable:

consul/api/api.go

Lines 69 to 73 in 32daa2b

// GRPCAddrEnvName defines an environment variable name which sets the gRPC
// address for consul connect envoy. Note this isn't actually used by the api
// client in this package but is defined here for consistency with all the
// other ENV names we use.
GRPCAddrEnvName = "CONSUL_GRPC_ADDR"

So the scheme is being set to https if CONSUL_GRPC_ADDR has an https address, but not CONSUL_HTTP_ADDR.

This might be the correct behaviour. I believe CONSUL_GRPC_ADDR is the address used by envoy to connect. When CONSUL_GRPC_ADDR is not set, it looks like we default c.grpcAddr to localhost:PORT:

if c.grpcAddr == "" {
port, err := c.lookupGRPCPort()
if err != nil {
c.UI.Error(fmt.Sprintf("Error connecting to Consul agent: %s", err))
}
if port <= 0 {
// This is the dev mode default and recommended production setting if
// enabled.
port = 8502
c.UI.Info(fmt.Sprintf("Defaulting to grpc port = %d", port))
}
c.grpcAddr = fmt.Sprintf("localhost:%v", port)
}
.

Maybe in this case we need to be defaulting it to an address with a scheme, so that the subsequent checks detect https correctly. I'm not yet sure how we would find the scheme at this place, but I will have another look.

@akhayyat
Copy link
Author

akhayyat commented Apr 7, 2020

Thanks for looking into this issue.

What I considered buggy is that this behavior is inconsistent with the documentation of the CONSUL_HTTP_ADDR environment variable, which states:

If the https:// scheme is used, CONSUL_HTTP_SSL is implied to be true.

In my case, with CONSUL_HTTP_ADDR set to use the https:// scheme, the behavior was different when CONSUL_HTTP_SSL was set or not, so it was not exactly implied.

@dnephin
Copy link
Contributor

dnephin commented Apr 7, 2020

You are correct, there is definitely a bug here.

@spuder
Copy link
Contributor

spuder commented Jun 8, 2020

I enabled TLS in my Nomad/Consul clusters and all of them started having this issue on the Nomad Agents.

failed to complete security handshake from "127.0.0.1

The only way I was able to resolve the issue was by physically rebooting the vm.

I also had to ensure the following is defined in the nomad config


  tls:
    http: false
    rpc: true
    rpc_upgrade_mode: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants