Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind introspection to localhost #2588

Merged
merged 2 commits into from
Aug 26, 2020
Merged

Conversation

yhlee-aws
Copy link
Contributor

@yhlee-aws yhlee-aws commented Aug 24, 2020

Summary

Binding introspection api to localhost

Implementation details

filter by localhost ip

Testing

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yhlee-aws yhlee-aws marked this pull request as ready for review August 25, 2020 23:18
@yhlee-aws yhlee-aws requested a review from a team August 25, 2020 23:18
@yhlee-aws yhlee-aws added this to the 1.44.2 milestone Aug 26, 2020
@yhlee-aws yhlee-aws merged commit 899e233 into aws:dev Aug 26, 2020
@leshik
Copy link

leshik commented Sep 2, 2020

@yunhee-l this change breaks all my clusters as I rely on ecs-agent port 51678 accessible by monitoring tools from inside the VPC. Please, revert this, or at least make it configurable!

@mamaremere
Copy link

mamaremere commented Sep 2, 2020

@yunhee-l I confirm @leshik 's request. I rely on the introspection API being available from inside the container to check the health of adjacent containers. I definitely need either a revert, or a more configurable setup.

@bacoboy
Copy link

bacoboy commented Sep 2, 2020

Same here. I use this port on load balancer health checks to know when a box is ready for ECS deploys. Access to this port is already restricted by security group rules. I second that this should be reverted.

@noma4i
Copy link

noma4i commented Sep 2, 2020

@yunhee-l what bug this PR fixes?

@stedelahunty
Copy link

I think this change has broken all the ECS task metadata endpoints from v1 to v4.

@yhlee-aws
Copy link
Contributor Author

I'm sorry this caused disruption. We are working on adding more flexible handling of this, during that time a workaround is to use older version of agent (v1.44.1 or older).

@yhlee-aws
Copy link
Contributor Author

I think this change has broken all the ECS task metadata endpoints from v1 to v4.

@stedelahunty our testing shows that metadata endpoints were not impacted, but if it has for you, can you provide more information please?

@mmmeff
Copy link

mmmeff commented Sep 2, 2020

This change is preventing cwagent from starting in my team's tasks and we're blocked from deploying any changes via CDK Pipelines for > 24 hours now.

2020-09-01T19:22:17Z W! retry [0/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! retry [1/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! retry [2/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! failing to call ecsagent taskinfo endpoint, error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

@yhlee-aws
Copy link
Contributor Author

we are prioritizing impact mitigation: #2605

@stedelahunty
Copy link

stedelahunty commented Sep 2, 2020

@yunhee-l
When we run 1.44.1 (and previous versions going a long way back) hitting http://172.17.0.2:51678/v1/tasks inside a container returns tasks metadata.

When we run 1.44.2 hitting the same endpoint returns java.net.ConnectException: Connection refused (Connection refused)

Our ecs-agent is started like this:
docker run -d --name=ecs-agent --restart=on-failure:10 --env-file=/etc/ecs/ecs.config --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/log/ecs/:/log --volume=/var/lib/ecs/data:/data --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro --volume=/var/run/docker/execdriver/native:/var/lib/docker/execdriver/native:ro --env=ECS_LOGFILE=/log/ecs-agent.log --env=ECS_DATADIR=/data --env=ECS_ENABLE_TASK_IAM_ROLE=true --env=ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true --env=ECS_NUM_IMAGES_DELETE_PER_CYCLE=5 --publish=0.0.0.0:51678:51678 --publish=0.0.0.0:51679:51679 amazon/amazon-ecs-agent:latest

ifconfig docker:

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:b7ff:fe24:9bd7  prefixlen 64  scopeid 0x20<link>
        ether 02:42:b7:24:9b:d7  txqueuelen 0  (Ethernet)
        RX packets 44309  bytes 9057693 (9.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 53253  bytes 43657353 (43.6 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

We have had the same configuration for a few years no with no issues, this change is the first time we've seen this behaviour.

As an attempt to fix the issue hitting v1, we also tried to hit the other metadata endpoints listed in here and the same exception was thrown

@petderek
Copy link
Contributor

petderek commented Sep 2, 2020

@stedelahunty (and anyone else who sees this same failure) thank you for the explanation, the situation is a little more clear now. Running agent in bridge mode would result in different behavior than what we tested. Reverting this commit will fix that issue for you as well. We'll report back here when its done.

(In our optimized AMIs, we've been running with --net=host since around the time we introduced task networking, which required agent to have a view of the host network. AFAIK, v2-v4 api's have always required net=host.)

@stedelahunty
Copy link

@petderek Thank you for that information, we will look to migrate to using --net=host. Out of interest, is it on the road map to deprecate v1 or will it be kept around for a while yet?

@petderek
Copy link
Contributor

petderek commented Sep 3, 2020

There are no plans to deprecate those endpoints. We've released an updated agent (AMIs and dockerhub) with this change reverted, so you should be good to go. Please let us know if you have any difficulties.

@yimuniao
Copy link

yimuniao commented Sep 21, 2020

This change is preventing cwagent from starting in my team's tasks and we're blocked from deploying any changes via CDK Pipelines for > 24 hours now.

2020-09-01T19:22:17Z W! retry [0/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! retry [1/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! retry [2/3], unable to get http response from http://10.0.190.23:51678/v1/tasks , error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

2020-09-01T19:22:17Z W! failing to call ecsagent taskinfo endpoint, error: unable to get response from http://10.0.190.23:51678/v1/tasks , error: Get "http://10.0.190.23:51678/v1/tasks ": dial tcp 10.0.190.23:51678: connect: connection refused

Could you share your cwagent configuration? and the task definition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.