Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancing ecs service multiple tasks with app mesh #70

Closed
shashanktomar opened this issue May 17, 2019 · 9 comments
Closed

Load balancing ecs service multiple tasks with app mesh #70

shashanktomar opened this issue May 17, 2019 · 9 comments
Labels

Comments

@shashanktomar
Copy link

We are trying out app-mesh. Our solution looks like the following:

Mesh Ingress (ALB) -> Service A -> Service B -> Mesh Egress -> External Service

In this setup, Service A and Service B are running over fargate with desired-count of 2 for each. These ECS services have cloud-map configuration enabled. We have verified that the traffic flows through envoy sidecar for both Service A and Service B.

When we hit ALB for ingress, the traffic is load-balanced for Service A by the ALB but the traffic from Service A always hit the same single task in Service B. We are unable to load balance traffic from Service A envoy proxy to Service B. Following is our understanding of the problem so far:

  • As each service has multiple tasks running, all of the tasks get registered to Cloud Map service registry under the same DNS name
  • This creates a couple of DNS A Records in Route 53 per service. It's worth mentioning that the ECS Service Discovery Options always creates a multi-value DNS record in Route53 if you don't select an existing service discovery name.
  • We can see in envoy admin config dump for Serice A, that the cluster config dump for Service B has LOGICAL_DNS as type
  • Now when Service A envoy proxy receives an egress request for Service B, it resolves the DNS from Route53 to the first IP in the list (as documented in LOGICAL_DNS config in envoy documents). This single IP of Task-1 is always targeted for outgoing request because of which the Task-2 for Service B never receives a request.

It will be helpful to understand why is it not working for us and also to get some idea about recommended practices around this pattern.

@shashanktomar shashanktomar changed the title [Question] Load balancing ecs service multiple tasks with app mesh Load balancing ecs service multiple tasks with app mesh May 23, 2019
@shashanktomar
Copy link
Author

Can someone please confirm that this relates to #47

@lavignes
Copy link

@shashanktomar Your observation is correct here. It seems that due to the fact that App Mesh will configure the cluster to use the LOGICAL_DNS discovery type Envoy will only route traffic to the first IP returned by Route53's DNS.

#47 is not intended to address this behavior directly but I believe when App Mesh integrates with CloudMap's Service Discovery (via Envoy's EDS Service Discovery type) rather than LOGICAL_DNS, Envoy would be able to load-balance each IP independently.

I'll discuss this with the team tomorrow, because I'm unsure if this is intended behavior.

@bcelenza
Copy link
Contributor

Route 53 DNS does round-robin its results. Is it possible what you're seeing is the result of low load from Service A to Service B? One possible answer might be that the first connection added to the pool in Service A's Envoy is serving all requests to Service B, always resulting in requests hitting the same task for Service B.

With some concurrent load, Service A's Envoy would need to create a new connection, which would begin load balancing to both tasks for Service B.

@shashanktomar
Copy link
Author

Thanks for the clarification @bcelenza. I will verify this behavior and get back to you.

@shashanktomar
Copy link
Author

shashanktomar commented May 24, 2019

@bcelenza, that was precisely the reason why it happened. Increasing the load and having concurrent requests invoke connections to the second task. Maybe it's worth mentioning in the docs that the load-balancing policy is not round-robin but spillover, even though it's an envoy internal concept.

@shreedharn
Copy link

shreedharn commented Aug 21, 2019

We are evaluating a similar setup but with http-namespace. There is no DNS A records or Route53 health checks. We rely on container-level health checks managed by ECS and HealthCheckCustomConfig. Since there is no Route53 in this scenario, we assume that envoy will route traffic to healthy nodes discovered with CloudMap discover-instances api call. if there are multiple healthy nodes registered to a CloudMap service, order of instances in discover-instances api response appears random. Is there any load balancing done at envoy or is the call to the upstream nodes random? With AppMesh configured and sidecar deployed, preliminary test results shows that backend nodes are picked randomly for the upstream request(Service A --> Service B with multiple Service B instances) . It will be good to have the expected behavior documented.

@lavignes
Copy link

@shreedharn sorry for the late reply. App Mesh also uses the discover-instances API and return the instances as-is to Envoy. All load-balancing is handled by Envoy itself currently... So this randomness appears to be due to that. Besides us documenting this behavior, we'd like to make this behave in a way that makes the most sense for our customers.

I'm worried we might also be re-randomizing the nodes every time new nodes are added and removed. I'll take a look at that is well. I'm not entirely sure Envoy handles this correctly.

@shreedharn
Copy link

@lavignes Thanks for the info. Can you also document the details of envoy's default load balancing algorithm? When we checked the config dump of the envoy(envoy_eni_ip:9901/config_dump) there is an entry "lb_policy": "ORIGINAL_DST_LB" under dynamic_active_clusters. The envoy documentation has a link describing the Original destination LB. But not sure how it works in the context of ECS, CloudMap and AppMesh. Is there anyway we can configure the load balancer algorithm in the envoy. For instance can we change it to Round robin or Weighted least request? The Envoy documentation part of AppMesh is very limited. It will be very helpful if it can be expanded with these details.

@shubharao
Copy link

We are configuring round-robin load balancing and this is not yet configurable in App Mesh API (feel free to open an feature request for it if you need it). The way it works is that Envoy will not connect to a different endpoint for each request and balance across all endpoints round-robin. It maintains as few TCP connections as possible, only creating new ones when an existing connection is busy. But once the connections are open, requests will be distributed round-robin. We will update App Mesh docs with it, good point on highlighting the doc gap here, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants