Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: Add support for ECS (Container Service) #1803

Merged
merged 4 commits into from
Jun 5, 2015

Conversation

radeksimko
Copy link
Member

This is first working concept, includes a basic set of acceptance tests + all docs.

Example

https://github.com/TimeIncOSS/tf_aws_ecs

Caveats

  • Tests will fail because aws: Refactor credentials according to latest aws-sdk-go refactoring #1802 , I tested everything w/ aws/aws-sdk-go@876a0d5
  • ECS Cluster destruction sometimes (race condition I guess) ends up with the following errors
    • ClientException: The Cluster cannot be deleted while Container Instances are active.
    • ClientException: The Cluster cannot be deleted while Services are active.
    • I'm not sure how/whether to define any relationship between the container instances (ASG) and services
  • ECS Task Definition
    • There's currently no way to remove any task definitions that have been created, therefore running acceptance tests will leave a few definitions in your AWS account. There's no way around this.
    • I'm using id as a place to save the Task Definition ARN ( = arn:aws:ecs:us-west-2:01234567890:task-definition/mongodb:3 where 3 is revision bumped each time you update it), which makes updates two-phased => need to run apply twice, first to update the TD, second to update the reference in the Service. I'm not sure what's the best way to approach this problem. - using family instead + exporting arn to keep referencing easy
  • ECS Service
    • API allows task_definition to be in two compatible formats - family:revision or full ARN, but full ARN is always returned as response from the API, which makes it confusing for updates. Therefore family:revision format doesn't really work well (it causes update-all-the-time). - solved w/ ARN builder similar to the RDS one

Test plan

$ make testacc TEST=./builtin/providers/aws TESTARGS='-run=Ecs' 2>/dev/null
go generate ./...
TF_ACC=1 go test ./builtin/providers/aws -v -run=Ecs -timeout 45m
=== RUN TestAccAWSEcsCluster
--- PASS: TestAccAWSEcsCluster (4.01s)
=== RUN TestAccAWSEcsServiceWithARN
--- PASS: TestAccAWSEcsServiceWithARN (105.53s)
=== RUN TestAccAWSEcsServiceWithFamilyAndRevision
--- PASS: TestAccAWSEcsServiceWithFamilyAndRevision (23.29s)
=== RUN TestAccAWSEcsTaskDefinition
--- PASS: TestAccAWSEcsTaskDefinition (1.84s)
PASS
ok      github.com/hashicorp/terraform-1/builtin/providers/aws  134.685s

@catsby
Copy link
Contributor

catsby commented May 5, 2015

Excellent work!

Do you have an idea on what items remain before you feel this is something we should merge in, or do you think it's read as is. It's unfortunate to read the shortcomings of the API at present, but exciting to get ECS in. To be honest, I was supposed to start work on this today 😄

I'm going to familiarize myself with the service and API and I'll be ready to help out where needed, if you have items you'd like someone else to chip away at.

Thanks again!

@radeksimko
Copy link
Member Author

Do you have an idea on what items remain before you feel this is something we should merge in, or do you think it's read as is.

Except the caveats mentioned above, it should work well, give it a try and deploy a sample cluster+service+TD using the simple demo module I provided above.

if you have items you'd like someone else to chip away at.

it's really mainly about the caveats, as mentioned. I left some // TODO comments in the code where I believe these issues need to be handled, so feel free to pick any of those.

@gjohnson
Copy link

gjohnson commented May 5, 2015

I'm using id as a place to save the Task Definition ARN ( = arn:aws:ecs:us-west-2:01234567890:task-definition/mongodb:3 where 3 is revision bumped each time you update it), which makes updates two-phased => need to run apply twice, first to update the TD, second to update the reference in the Service. I'm not sure what's the best way to approach this problem.

Maybe just use the family?

@radeksimko
Copy link
Member Author

@gjohnson I was thinking about that as well, I just wasn't sure how that will work in the future, when we may want to deploy TDs cross-region... but in that case we'd just have region resource parameter w/ ForceNew: true, so it's probably ok.

The other issue with this is that referencing will become a bit more "verbose", e.g.

resource "aws_ecs_service" "main" {
  name = "tf-ecs-service"
  task_definition = "${aws_ecs_task_definition.main.family}.${aws_ecs_task_definition.main.revision}"
  desired_count = "${var.service_desired_count}"
}

@radeksimko
Copy link
Member Author

I fixed most major issues (the ClientExceptions are not happening anymore, most likely because of explicit dependencies), but I discovered another one, which happens rarely and is actually a known ECS bug.

To quote an AWS support representative:

If the service has an ELB and the ELB is deleted before the service, then the service won't be able to go to INACTIVE state.

This may cause some services to be dead-locked (remain in "DRAINING" state and not reacting to delete requests), which means you cannot create another ECS service with the same name.
The work around is to create the ELB again, remove the service and then delete the ELB.

It's also worth mentioning, that none of the two AWS' issues (inability of removing task definitions & dead-locked services) are costing the customer money. All ECS usage is billed based on consumption of EC2 resources, so as long as EC2 instances are terminated, it's just mess for free.

@catsby Will you give it a full review?

@radeksimko
Copy link
Member Author

I may actually be able to work around that AWS bug simply by running ecs.DeregisterContainerInstance with a list of instances coming from ecs.ListContainerInstances.

I will have a look.

@catsby
Copy link
Contributor

catsby commented May 18, 2015

I can review this today if you think it's ready. Any new caveats / limitations ?

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label May 18, 2015
@radeksimko
Copy link
Member Author

There's one I discovered while testing this during the weekend - I was not able to reference ecs_cluster.name for unknown reason...
https://circleci.com/gh/TimeIncOSS/tf_aws_ecs/15

screen shot 2015-05-18 at 07 58 17

I don't remember having this issue before... but I admit I was rebasing all commits from master regularly every couple days since this PR exists...
It must be something silly... If you help me debugging that one, I think it's good to go.


I also just realised that the AWS known bug w/ ELBs/services cannot be fixed via ecs.ListContainerInstances nor ecs.DeregisterContainerInstance as these API methods are for "general purpose" instances which may or may not be related to that service, so it's not wise to deregister these.

In other words, we would be deregistering the whole instance from a cluster, not a service from ELB.

@radeksimko radeksimko removed the waiting-response An issue/pull request is waiting for a response from the community label May 18, 2015
@catsby
Copy link
Contributor

catsby commented May 19, 2015

There's a lot here, but overall looks very good. I did not run into the ecs_cluster.name issue above, but I did run into * ClusterContainsServicesException: The Cluster cannot be deleted while Services are active. when trying to destroy 😦 I couldn't see any active services in the console after I ran this delete though, what's the recourse here?

@catsby
Copy link
Contributor

catsby commented May 19, 2015

Trying to recreate the cluster from your example module (which is great, btw), I'm getting this error:

* InvalidParameterException: Creation of service was not idempotent.

Have you ran into that?

@radeksimko
Copy link
Member Author

ClusterContainsServicesException

Interesting... did you use the module provided or you just built something custom?
Either way this one should be fairly easy to fix... just keep trying a few minutes, like we do in other resources. Not perfect solution, but it will do the job.

  • InvalidParameterException: Creation of service was not idempotent.

I have not ran into this one... but it may be just because I was blocked by the ecs_cluster.name...

@catsby
Copy link
Contributor

catsby commented May 19, 2015

Interesting... did you use the module provided or you just built something custom?

I used the module

just keep trying a few minutes

I did, and it seemed stuck. I went to lunch and now it destroyed fine.

@catsby
Copy link
Contributor

catsby commented May 27, 2015

Hey @radeksimko – sorry that this has been hanging out here so long. I did review and the code checks out. There were some bumps as I mentioned, but it's not clear what we can do about those from our side, would you agree? Do you think this is ready for merging?

@radeksimko
Copy link
Member Author

@catsby I'd like to see it working again in my environment before it goes in 😃
I can try and do some further testing this evening (BST).

Last time I tried I ended up with the "ecs_cluster.name missing" issue and I also remember from the past having some troubles building the IAM instance profile + role via IAM TF resources -> I ended up creating IAM via the AWS Console "ECS - getting started" and taking ARNs from there.

Otherwise if you feel it's safe to merge it right now, I won't prevent you from doing so. 😄

Either way, I'm ok with eventually ignoring the race conditions (ClientException) and the ELB deregistration bug + task definition removal inability - that's really something we just can't easily fix on our side. Bugfixes for these things will hopefully come at some point in the future.

@radeksimko
Copy link
Member Author

So I quickly tested and I'm still getting the same error:

Error running plan: 1 error(s) occurred:

* 1 error(s) occurred:

* Resource 'aws_ecs_cluster.main' does not have attribute 'name' for variable 'aws_ecs_cluster.main.name'

and I don't understand why you're not seeing that same error.

@catsby
Copy link
Contributor

catsby commented May 29, 2015

Have you tried make updatedeps recently?

The service acceptance tests fail for me:

* ClusterNotFoundException: Cluster not found.
            status code: 400, request id: []

But cluster tests are fine

@radeksimko
Copy link
Member Author

@catsby I plan to have a look at ECS during the upcoming weekend and ideally make it ready for final review & merge. It has been waiting for almost a month now, so I think it's time 😃

@catsby
Copy link
Contributor

catsby commented May 29, 2015

@radeksimko I apologize for the delays here, I really appreciate the hard work you have and continue to put in here

@radeksimko
Copy link
Member Author

I can confirm I'm now getting the following error as well:

* ClusterNotFoundException: The referenced cluster was inactive.
            status code: 400, request id: []

looking into it now.

@radeksimko
Copy link
Member Author

Bug found + fixed - acceptance tests for ecs_service were just using the default ecs_cluster which was expected to be in the AWS accounts (and it usually is, unless you delete it, like me and quite likely you did).

Now it's creating its own cluster, which is much better for isolation of those tests anyway.

@radeksimko
Copy link
Member Author

@catsby The demo module now contains IAM resources as well. The whole functionality is tested and works for creating & updating ECS resources, but deleting is kinda "broken" due to the mentioned annoying ELB bug.

I feel we need to figure out how to delete the ecs_service gracefully so it doesn't stuck in DRAINING state, then we're good to go.

I may or may not have time this week to work on it. I will see. I spent most of my time this weekend on #2157 so I had less time for this.

@radeksimko
Copy link
Member Author

I feel we need to figure out how to delete the ecs_service gracefully so it doesn't stuck in DRAINING state, then we're good to go.

I did get a response from AWS, that this bug is now fixed, so I went ahead and tested everything all over again. I can confirm that it does not stuck in dead-lock anymore. 🎉

I've been also told that the task-definition de-registration is on the way. 😃

In other words, I think this is finally ready for a full review & merge.

@catsby
Copy link
Contributor

catsby commented Jun 5, 2015

This is fantastic work @radeksimko!
Please take the honor of clicking the Big Green Button and merge this

I'll update the CHANGELOG afterwards.

Again, thanks! 🎆

radeksimko added a commit that referenced this pull request Jun 5, 2015
aws: Add support for ECS (Container Service)
@radeksimko radeksimko merged commit 1770713 into hashicorp:master Jun 5, 2015
@radeksimko radeksimko deleted the ecs branch June 5, 2015 20:25
@radeksimko
Copy link
Member Author

Done 😄 🎉

@catsby
Copy link
Contributor

catsby commented Jun 5, 2015

awesome

@jbrook
Copy link

jbrook commented Jun 16, 2015

I am playing with the ECS functionality and I ran into an issue with specifying volumes in a task definition. It should be possible to pass an empty host in which case the volume is not made persistent - the docker daemon manages its location and garbage collection:

http://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_data_volumes.html

@ghost
Copy link

ghost commented May 2, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators May 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants