Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atlantis fails to complete plan for a large environment. #452

Closed
gvwirth opened this issue Feb 1, 2019 · 17 comments · Fixed by #464
Closed

Atlantis fails to complete plan for a large environment. #452

gvwirth opened this issue Feb 1, 2019 · 17 comments · Fixed by #464
Labels
bug Something isn't working

Comments

@gvwirth
Copy link

gvwirth commented Feb 1, 2019

With the latest Atlantis container, we've run into an issue where our largest TF environment never completes its Atlantis plan. This environment has a couple hundred resources tracked in it. The logs indicate that it reaches the "terraform workspace show" section of the auto-plan, but the actual plan never completes:

2019/02/01 20:45:49+0000 [INFO] project_command_builder.go:156 team/eng-terraform#110: 2 projects are to be planned based on their when_modified config

2019/02/01 20:45:49+0000 [INFO] project_locker.go:74 team/eng-terraform#110: Acquired lock with id "team/eng-terraform/envs/prod-global/iam-permissions/default"

2019/02/01 20:45:49+0000 [DBUG] project_command_runner.go:131 team/eng-terraform#110: Acquired lock for project

2019/02/01 20:45:49+0000 [DBUG] working_dir.go:78 team/eng-terraform#110: Clone directory "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default" already exists, checking if it's at the right commit

2019/02/01 20:45:49+0000 [DBUG] working_dir.go:100 team/eng-terraform#110: Repo is at correct commit "" so will not re-clone

2019/02/01 20:45:52+0000 [INFO] terraform_client.go:171 team/eng-terraform#110: Successfully ran "terraform init -input=false -no-color" in "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default/envs/prod-global/iam-permissions"

2019/02/01 20:45:52+0000 [INFO] terraform_client.go:171 team/eng-terraform#110: Successfully ran "terraform workspace show" in "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default/envs/prod-global/iam-permissions"

Beyond that we get no further logs entries for this environment.

If I manually delete the lock on the environment and comment "atlantis plan" again on the PR, I get this error:

the default workspace is currently locked by another command that is running for this pull request–wait until the previous command is complete and try again

Since there are no more locks available to delete from the UI, the only solution is to manually apply the Terraform changes outside of Atlantis and then merge the PR, despite the CI check failure.

We do not have this behavior in our other, smaller environments (approx a dozen resources) -- just this very large one.

This does not occur in version 0.4.13 -- the plan successfully completes for the same large environment.

@mignaulo
Copy link

mignaulo commented Feb 1, 2019

Some of the debugging we did, hopefully this helps:
From what I saw from the processlist, Atlantis would run
terraform plan -input=false -refresh -no-color -out /home/atlantis/.atlantis/repos/xxxx/xxxx/110/default/envs/prod-global/iam-permissions/default.tfplan ...
which would start
/usr/local/bin/tf/versions/0.11.11/terraform plan -input=false -refresh -no-color -out /home/atlantis/.atlantis/repos/xxxx/xxxx/110/default/envs/prod-global/iam-permissions/default.tfplan ...
At some point, /usr/local/bin/tf/versions/0.11.11/terraform would finish running, but the initial terraform plan would keep running seemingly forever. I could kill that terraform plan process, which would allow Atlantis to post a reply to Github.

Could it be related to this update? #421

@lkysow
Copy link
Member

lkysow commented Feb 4, 2019

@mignaulo it might be related to that. Is there a crash.log in the directory?

@lkysow lkysow added the bug Something isn't working label Feb 4, 2019
@lkysow
Copy link
Member

lkysow commented Feb 4, 2019

Also, what happens when you run the command manually?

@mignaulo
Copy link

mignaulo commented Feb 4, 2019

I was able to run the command manually successfully. I did not look for a crash.log entry unfortunately :/

@lkysow
Copy link
Member

lkysow commented Feb 4, 2019

Can you reproduce it? Does it happen every time?

@lkysow
Copy link
Member

lkysow commented Feb 4, 2019

@gvwirth

This does not occur in version 0.4.13 -- the plan successfully completes for the same large environment.

Are you running runatlantis/atlantis:latest then?

@mignaulo
Copy link

mignaulo commented Feb 4, 2019

It happened every time while running runatlantis/atlantis:latest.
We've switched to runatlantis/atlantis:v0.4.13 to fix it.

@lkysow
Copy link
Member

lkysow commented Feb 4, 2019

I was able to run the command manually successfully.

Locally or in the same dir and on the same server as Atlantis?

@mignaulo
Copy link

mignaulo commented Feb 4, 2019

I ran the same command Atlantis ran in the same directory on the same server as Atlantis.

@lkysow
Copy link
Member

lkysow commented Feb 7, 2019

Is there any way for me to reproduce this locally? Does it happen for all of your Terraform projects or is it only for that large one?

@mignaulo
Copy link

mignaulo commented Feb 7, 2019

Hi - It only happens for our largest project. To give you an idea of size, the state file is 1.6MB. Plans take 3 or 4 minutes on a fairly small Fargate instance. (when we ran into issues, we switched to EC2-backed ECS so we could SSH into the container and get more information. We switched back since then)
If desired, we could try to switch back to latest and run some tests for you - if you want us to try anything specific, etc.

@lkysow
Copy link
Member

lkysow commented Feb 7, 2019

Can I build a custom Atlantis image for you with extra logging? Do you need anything special in the image or do you just use the default image.

@mignaulo
Copy link

mignaulo commented Feb 7, 2019

We just use the default image 👍 We'll be happy to run some tests on your custom image.

@lkysow
Copy link
Member

lkysow commented Feb 7, 2019

Okay so I've created a Docker image lkysow/atlantis-debug:v1 off of this branch: https://github.com/runatlantis/atlantis/tree/custom-debug-image.

I'd like you to

  • Run that image with --log-level=debug
  • If Atlantis hangs, go to the new /debug/pprof/ endpoint and click on "full goroutine stack dump" and copy the data.
  • Also collect all the log entries and send them to me. You can do this via our Slack channel for security.
  • Finally, can you set the environment variable ATLANTIS_OLD_EXEC=true and then restart Atlantis and re-run the plan. This will cause Atlantis to use the old method of execution. Let me know if that one works.

Thanks so much for helping me debug this.

@lkysow
Copy link
Member

lkysow commented Feb 11, 2019

For anyone reading along, Olivier helped me debug this and even came up with the solution!

Basically, I wasn't reading off the OS pipe concurrently and so if there was enough output, terraform would stall waiting for the pipe to empty out and Atlantis would wait until terraform finished executing before reading off the pipe ==> deadlock.

The fix was to read off the pipe concurrently but while testing the fix, I realized the terraform panic that my original code changes were meant to solve was actually being correctly caught using CombinedOutput(). My previous Terraform panics were using the 0.12 alpha build but when I built my own terraform binary with an embedded panic, CombinedOutput() had no issues. As a result, I've switched back to using CombinedOutput in the latest release (v0.4.15) and I would recommend no one use v0.4.14.

@lkysow
Copy link
Member

lkysow commented Feb 11, 2019

@mignaulo
Copy link

Thanks @lkysow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants