Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting: upstream connect error or disconnect/reset before headers. reset reason: connection termination #196

Closed
awsiv opened this issue May 10, 2020 · 15 comments
Assignees
Labels
Bug Something isn't working Docs

Comments

@awsiv
Copy link

awsiv commented May 10, 2020

Summary
Getting the following error on a call to backend servicie:

upstream connect error or disconnect/reset before headers. reset reason: connection termination

Error on app1log:

 Get http://app2.test/pong:  dial tcp: lookup app2.test on xx.xx.xx.xx: no such host

envoy /clusters

cds_egress_stage_app2_http_9999::default_priority::max_connections::1024
cds_egress_stage_app2_http_9999::default_priority::max_pending_requests::1024
cds_egress_stage_app2_http_9999::default_priority::max_requests::1024
cds_egress_stage_app2_http_9999::default_priority::max_retries::3
cds_egress_stage_app2_http_9999::high_priority::max_connections::1024
cds_egress_stage_app2_http_9999::high_priority::max_pending_requests::1024
cds_egress_stage_app2_http_9999::high_priority::max_requests::1024
cds_egress_stage_app2_http_9999::high_priority::max_retries::3
cds_egress_stage_app2_http_9999::added_via_api::true
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::cx_active::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::cx_connect_fail::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::cx_total::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::rq_active::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::rq_error::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::rq_success::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::rq_timeout::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::rq_total::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::hostname::
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::health_flags::healthy
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::weight::1
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::region::
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::zone::
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::sub_zone::
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::canary::false
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::priority::0
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::success_rate::-1
cds_egress_stage_app2_http_9999::XX.XX.XX.XX:9999::local_origin_success_rate::-1

Steps to Reproduce

Setup:

  • Apps are running on ECS/fargate
  • Cloud map is configured to be API only
  • app1 has app2 as a backend node in appmesh
  • both apps have single route and single node with listener (http:8080)
  • /cluster shows that he backend node is discovered correctly
  • app1 ENV: APP2_ENDPOINT | app2.test
  • fargate version (1.3.0 - LATEST)
  • envoy version: "image": "840364872350.dkr.ecr.us-east-1.amazonaws.com/aws-appmesh-envoy:v1.12.2.1-prod",

Configure 2 apps on appmesh as mentioned above

app1.go

package main

import (
	"errors"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"strings"
)

func health(w http.ResponseWriter, req *http.Request) {
	log.Printf("Endpoint=/health")
	fmt.Fprintf(w, "OK")
}

func getEndpoint(envVarName string) (string, error) {
	app2Endpoint := os.Getenv(envVarName)
	if app2Endpoint == "" {
		errStr := fmt.Sprintf("%s is not set", envVarName)
		log.Fatalf(errStr)
		return "", errors.New(errStr)
	}
	return app2Endpoint, nil
}

func pingRemote(w http.ResponseWriter, req *http.Request) {
	log.Printf("Endpoint=/ping")

	app2, _ := getEndpoint("APP2_ENDPOINT")
	log.Printf("Remote Endpoint=%s", app2)

	resp, _ := http.Get(fmt.Sprintf("http://%s/pong", app2))

	body, _ := ioutil.ReadAll(resp.Body)
	ret := strings.TrimSpace(string(body))
	fmt.Fprintf(w, `{"ping":"%s"}`, ret)
}

func main() {
	http.HandleFunc("/health", health)
	http.HandleFunc("/ping", pingRemote)

	log.Printf("Listening on port 8080")
	http.ListenAndServe(":8080", nil)
}

app2.go

package main

import (
	"fmt"
	"log"
	"net/http"
)

func pong(w http.ResponseWriter, req *http.Request) {
	log.Printf("Endpoint=/pong")
	log.Printf("Referer=%s", req.RequestURI)
	fmt.Fprintf(w, "pong")
}

func health(w http.ResponseWriter, req *http.Request) {
	log.Printf("Endpoint=/health")
	fmt.Fprintf(w, "OK")
}

func main() {
	log.Printf("app2: start")

	http.HandleFunc("/health", health)
	http.HandleFunc("/pong", pong)

	log.Printf("Listening on port 8080")
	http.ListenAndServe(":8080", nil)
}

Are you currently working around this issue?
No workaround yet

Additional info

Think fargate 1.3.0 still uses docker daemon, could it be that its not configured with the proper proxy settings: https://docs.docker.com/config/daemon/systemd/#httphttps-proxy ?

EDIT: same issue with fargate 1.4.0 as well

@awsiv awsiv added the Bug Something isn't working label May 10, 2020
@bigdefect
Copy link
Contributor

Based on the available information, I think at least part of the problem is that app1 is making a DNS call for a name that won't exist since you're in API-only mode. Envoy currently does not intercept DNS traffic (I believe we're working on the DNS filter).

A workaround is to have a dummy entry in /etc/hosts for the endpoint, so the request is at least made to envoy (which has the cloudmap endpoints and can then route it). Can you try a namespace with private dns as well and see if the behavior changes?

@awsiv
Copy link
Author

awsiv commented May 12, 2020

@efe-selcuk My first test was indeed with api+private dns and it worked fine there (sorry, should have mentioned it in my original post.). Will test it out again - just to confirm

BTW: editing /etc/hosts seems to be broken in fargate 1.4.0 see: aws/containers-roadmap#886

EDIT: @efe-selcuk Confirmed to work with API+private DNS namespace and services with

  dns_config {
    namespace_id   = "ns-xxx"
    routing_policy = "MULTIVALUE"
    dns_records {
      ttl  = 10
      type = "A"
    }
  }

@bigdefect
Copy link
Contributor

bigdefect commented May 12, 2020

Thanks for linking the fargate issue, I'll pass this onto my team as this affects troubleshooting.

Glad that private dns solved it. Given that it works with private DNS, do you still have an open issue? Outside of the general issue of requiring DNS even when Cloud Map is enabled, that is.

Either way I'll leave this open to see if our dataplane team needs any further info.

@lavignes
Copy link

I'll also link the Envoy DNS filter tracking issue: envoyproxy/envoy#6748 if you are intrested in tracking the work going on there. We are indeed actively working on changes to the upstream Envoy project that will make these DNS entries unnecessary by having the proxy intercept and resolve these DNS queries without the need for hacking /etc/hosts, or setting up public or private DNS namespaces.

@awsiv
Copy link
Author

awsiv commented May 13, 2020

@efe-selcuk @lavignes Testing out API-only was the decision we made due to the following limitations with PrivateDNS based discovery with AppMesh and Cloudmap. I'm open to workarounds/explanations/corrections for these:

  1. Api(DiscoverInstances) are generally faster than dns lookups. So any updates to the services will be available faster(https://aws.amazon.com/blogs/architecture/new-application-integration-with-aws-cloud-map-for-service-discovery/)
  2. Cloudmap returns at most 8 entries in DNS mode (https://docs.amazonaws.cn/en_us/AmazonECS/latest/developerguide/service-discovery.html)
  3. Appmesh spillover load balancing instead of round-robin until it has established connection to all endpoints? (Load balancing ecs service multiple tasks with app mesh #70 (comment))
  4. Additional attributes not supported (https://aws.amazon.com/blogs/architecture/new-application-integration-with-aws-cloud-map-for-service-discovery/)
  5. Java apps cache dns resolutions, however since its fronted by envoy, I suoopse this is not an issue? (https://aws.amazon.com/blogs/architecture/new-application-integration-with-aws-cloud-map-for-service-discovery/)

Currently it appears we cannot create services without DnsConfig with private dns although it says "API and private VPC" in the Cloudmap console.

Error: InvalidInput: When you create a service using a namespace that has a type of DNS_PRIVATE, you must include a DnsConfig element.

image

@awsiv
Copy link
Author

awsiv commented May 13, 2020

FYI: for people using golang.. it defaults to DNS resolution if /etc/nsswitch.conf completely ignoring /etc/hosts

golang/go#35305

@bigdefect
Copy link
Contributor

bigdefect commented May 13, 2020

To be clear, if you configure Cloud Map service discovery on your virtual node, App Mesh will use the API to discover endpoints, not relying on DNS. The suggestion to also use the private dns mode is purely to make sure that ECS+Cloud Map is populating DNS as a workaround for the application DNS queries. So, I believe that addresses 2, 3, and 4?

You could also set up your infrastructure such that you set your own dummy entries in Route 53 (what you would have done in /etc/hosts). That also has the benefit of being agnostic to compute platform. We know this is painful regardless of which way you work around the DNS issue.

Re: 1, you do of course incur the overhead of those DNS queries from the application, which the hosts file hack helps address, and similarly for 5, dns caching will help reduce the impact.

@awsiv
Copy link
Author

awsiv commented May 13, 2020

Thanks that clarifies some of my concerns. Private DNS namespace should work fine for now (we kinda have to live with the DNS lookups and the overhead that comes with it). However, at some point we would like to move to API only namespace - all those Route53 entries triggers my OCD ;)

I suppose to migrate apps between cloudmap namespaces, we just have to do the following?

  1. Have apps register to both API only and private DNS namespace for migration
  2. Update apps to use the new namespace AND switch appmesh virtual service to the new API only namespace
  3. Remove private dns registration

Sounds like a bit of a work if this is the case. Would be nice if we could register services in private dns namespace without the dns config (since it supports both API and DNS calls)?

@bigdefect
Copy link
Contributor

bigdefect commented May 15, 2020

Correct. There are multiple ways to accomplish the migration, but as you've written it, you'd want to break up no. 2 into distinct steps to avoid inconsistency.

Though, you don't necessarily have to modify your downstream applications, depending on how your mesh and DNS/namespaces are structured. It's possible to keep that stable virtual service name, since the nodes behind it independently specify their own service discovery; i.e., just point your virtual service to different virtual nodes. What's actually in those route53 entries effectively doesn't matter, in this case, so it's ok if they go stale (considering TTLs of course).

@dastbe
Copy link
Contributor

dastbe commented Jun 10, 2020

@dastbe dastbe closed this as completed Jun 10, 2020
@aruandre
Copy link

Envoy currently does not intercept DNS traffic (I believe we're working on the DNS filter).

@efe-selcuk has this changed? We are facing the same issue with similar setup

@bigdefect
Copy link
Contributor

@aruandre This is still the behavior. I believe we are ramping up on investigations/design and such, but I can't speak to progress or timelines. I'm sure a feature request issue will pop up, or honestly feel free to create one in the queue asking for it. If an issue exists that I don't know about, an engineer will redirect.

@aruandre
Copy link

@efe-selcuk that's disappointing. Took us quite some time troubleshooting and wondering why we can't make it work with AppMesh, finally discovering this issue.
I'm wondering what is the use case for HttpNamespace then since it is not usable with AppMesh right now?

@bigdefect
Copy link
Contributor

@aruandre Yeah, it's a long standing pain point we'd like to resolve. I believe API-only namespace is usable, but you would need to work around the application DNS issue yourself, either by creating the dummy records in route53 or on your platform (e.g. hosts file, assuming whatever you're using supports it).

@rajal-amzn
Copy link
Contributor

@aruandre So, IIUC, you are looking at AppMesh to manage the DNS requests without having to create Route53 records. This is being tracked in this issue. We are currently working on this and expect to see some updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Docs
Projects
None yet
Development

No branches or pull requests

8 participants