-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load balancing ecs service multiple tasks with app mesh #70
Comments
Can someone please confirm that this relates to #47 |
@shashanktomar Your observation is correct here. It seems that due to the fact that App Mesh will configure the cluster to use the LOGICAL_DNS discovery type Envoy will only route traffic to the first IP returned by Route53's DNS. #47 is not intended to address this behavior directly but I believe when App Mesh integrates with CloudMap's Service Discovery (via Envoy's EDS Service Discovery type) rather than LOGICAL_DNS, Envoy would be able to load-balance each IP independently. I'll discuss this with the team tomorrow, because I'm unsure if this is intended behavior. |
Route 53 DNS does round-robin its results. Is it possible what you're seeing is the result of low load from Service A to Service B? One possible answer might be that the first connection added to the pool in Service A's Envoy is serving all requests to Service B, always resulting in requests hitting the same task for Service B. With some concurrent load, Service A's Envoy would need to create a new connection, which would begin load balancing to both tasks for Service B. |
Thanks for the clarification @bcelenza. I will verify this behavior and get back to you. |
@bcelenza, that was precisely the reason why it happened. Increasing the load and having concurrent requests invoke connections to the second task. Maybe it's worth mentioning in the docs that the load-balancing policy is not round-robin but spillover, even though it's an envoy internal concept. |
We are evaluating a similar setup but with http-namespace. There is no DNS A records or Route53 health checks. We rely on container-level health checks managed by ECS and HealthCheckCustomConfig. Since there is no Route53 in this scenario, we assume that envoy will route traffic to healthy nodes discovered with CloudMap discover-instances api call. if there are multiple healthy nodes registered to a CloudMap service, order of instances in discover-instances api response appears random. Is there any load balancing done at envoy or is the call to the upstream nodes random? With AppMesh configured and sidecar deployed, preliminary test results shows that backend nodes are picked randomly for the upstream request(Service A --> Service B with multiple Service B instances) . It will be good to have the expected behavior documented. |
@shreedharn sorry for the late reply. App Mesh also uses the discover-instances API and return the instances as-is to Envoy. All load-balancing is handled by Envoy itself currently... So this randomness appears to be due to that. Besides us documenting this behavior, we'd like to make this behave in a way that makes the most sense for our customers. I'm worried we might also be re-randomizing the nodes every time new nodes are added and removed. I'll take a look at that is well. I'm not entirely sure Envoy handles this correctly. |
@lavignes Thanks for the info. Can you also document the details of envoy's default load balancing algorithm? When we checked the config dump of the envoy(envoy_eni_ip:9901/config_dump) there is an entry "lb_policy": "ORIGINAL_DST_LB" under dynamic_active_clusters. The envoy documentation has a link describing the Original destination LB. But not sure how it works in the context of ECS, CloudMap and AppMesh. Is there anyway we can configure the load balancer algorithm in the envoy. For instance can we change it to Round robin or Weighted least request? The Envoy documentation part of AppMesh is very limited. It will be very helpful if it can be expanded with these details. |
We are configuring round-robin load balancing and this is not yet configurable in App Mesh API (feel free to open an feature request for it if you need it). The way it works is that Envoy will not connect to a different endpoint for each request and balance across all endpoints round-robin. It maintains as few TCP connections as possible, only creating new ones when an existing connection is busy. But once the connections are open, requests will be distributed round-robin. We will update App Mesh docs with it, good point on highlighting the doc gap here, thank you! |
We are trying out app-mesh. Our solution looks like the following:
Mesh Ingress (ALB) -> Service A -> Service B -> Mesh Egress -> External Service
In this setup, Service A and Service B are running over fargate with desired-count of 2 for each. These ECS services have cloud-map configuration enabled. We have verified that the traffic flows through envoy sidecar for both Service A and Service B.
When we hit ALB for ingress, the traffic is load-balanced for Service A by the ALB but the traffic from Service A always hit the same single task in Service B. We are unable to load balance traffic from Service A envoy proxy to Service B. Following is our understanding of the problem so far:
It will be helpful to understand why is it not working for us and also to get some idea about recommended practices around this pattern.
The text was updated successfully, but these errors were encountered: