Atlantis fails to complete plan for a large environment. #452

gvwirth · 2019-02-01T21:33:50Z

With the latest Atlantis container, we've run into an issue where our largest TF environment never completes its Atlantis plan. This environment has a couple hundred resources tracked in it. The logs indicate that it reaches the "terraform workspace show" section of the auto-plan, but the actual plan never completes:

2019/02/01 20:45:49+0000 [INFO] project_command_builder.go:156 team/eng-terraform#110: 2 projects are to be planned based on their when_modified config

2019/02/01 20:45:49+0000 [INFO] project_locker.go:74 team/eng-terraform#110: Acquired lock with id "team/eng-terraform/envs/prod-global/iam-permissions/default"

2019/02/01 20:45:49+0000 [DBUG] project_command_runner.go:131 team/eng-terraform#110: Acquired lock for project

2019/02/01 20:45:49+0000 [DBUG] working_dir.go:78 team/eng-terraform#110: Clone directory "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default" already exists, checking if it's at the right commit

2019/02/01 20:45:49+0000 [DBUG] working_dir.go:100 team/eng-terraform#110: Repo is at correct commit "" so will not re-clone

2019/02/01 20:45:52+0000 [INFO] terraform_client.go:171 team/eng-terraform#110: Successfully ran "terraform init -input=false -no-color" in "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default/envs/prod-global/iam-permissions"

2019/02/01 20:45:52+0000 [INFO] terraform_client.go:171 team/eng-terraform#110: Successfully ran "terraform workspace show" in "/home/atlantis/.atlantis/repos/team/eng-terraform/110/default/envs/prod-global/iam-permissions"

Beyond that we get no further logs entries for this environment.

If I manually delete the lock on the environment and comment "atlantis plan" again on the PR, I get this error:

the default workspace is currently locked by another command that is running for this pull request–wait until the previous command is complete and try again

Since there are no more locks available to delete from the UI, the only solution is to manually apply the Terraform changes outside of Atlantis and then merge the PR, despite the CI check failure.

We do not have this behavior in our other, smaller environments (approx a dozen resources) -- just this very large one.

This does not occur in version 0.4.13 -- the plan successfully completes for the same large environment.

The text was updated successfully, but these errors were encountered:

mignaulo · 2019-02-01T21:46:38Z

Some of the debugging we did, hopefully this helps:
From what I saw from the processlist, Atlantis would run
terraform plan -input=false -refresh -no-color -out /home/atlantis/.atlantis/repos/xxxx/xxxx/110/default/envs/prod-global/iam-permissions/default.tfplan ...
which would start
/usr/local/bin/tf/versions/0.11.11/terraform plan -input=false -refresh -no-color -out /home/atlantis/.atlantis/repos/xxxx/xxxx/110/default/envs/prod-global/iam-permissions/default.tfplan ...
At some point, /usr/local/bin/tf/versions/0.11.11/terraform would finish running, but the initial terraform plan would keep running seemingly forever. I could kill that terraform plan process, which would allow Atlantis to post a reply to Github.

Could it be related to this update? #421

lkysow · 2019-02-04T20:40:33Z

@mignaulo it might be related to that. Is there a crash.log in the directory?

lkysow · 2019-02-04T20:41:03Z

Also, what happens when you run the command manually?

mignaulo · 2019-02-04T21:02:26Z

I was able to run the command manually successfully. I did not look for a crash.log entry unfortunately :/

lkysow · 2019-02-04T21:17:41Z

Can you reproduce it? Does it happen every time?

lkysow · 2019-02-04T21:18:41Z

@gvwirth

This does not occur in version 0.4.13 -- the plan successfully completes for the same large environment.

Are you running runatlantis/atlantis:latest then?

mignaulo · 2019-02-04T21:20:50Z

It happened every time while running runatlantis/atlantis:latest.
We've switched to runatlantis/atlantis:v0.4.13 to fix it.

lkysow · 2019-02-04T21:23:48Z

I was able to run the command manually successfully.

Locally or in the same dir and on the same server as Atlantis?

mignaulo · 2019-02-04T21:25:18Z

I ran the same command Atlantis ran in the same directory on the same server as Atlantis.

lkysow · 2019-02-07T16:09:09Z

Is there any way for me to reproduce this locally? Does it happen for all of your Terraform projects or is it only for that large one?

mignaulo · 2019-02-07T18:51:36Z

Hi - It only happens for our largest project. To give you an idea of size, the state file is 1.6MB. Plans take 3 or 4 minutes on a fairly small Fargate instance. (when we ran into issues, we switched to EC2-backed ECS so we could SSH into the container and get more information. We switched back since then)
If desired, we could try to switch back to latest and run some tests for you - if you want us to try anything specific, etc.

lkysow · 2019-02-07T18:55:50Z

Can I build a custom Atlantis image for you with extra logging? Do you need anything special in the image or do you just use the default image.

mignaulo · 2019-02-07T18:57:45Z

We just use the default image 👍 We'll be happy to run some tests on your custom image.

lkysow · 2019-02-07T21:14:12Z

Okay so I've created a Docker image lkysow/atlantis-debug:v1 off of this branch: https://github.com/runatlantis/atlantis/tree/custom-debug-image.

I'd like you to

Run that image with --log-level=debug
If Atlantis hangs, go to the new /debug/pprof/ endpoint and click on "full goroutine stack dump" and copy the data.
Also collect all the log entries and send them to me. You can do this via our Slack channel for security.
Finally, can you set the environment variable ATLANTIS_OLD_EXEC=true and then restart Atlantis and re-run the plan. This will cause Atlantis to use the old method of execution. Let me know if that one works.

Thanks so much for helping me debug this.

lkysow · 2019-02-11T21:12:57Z

For anyone reading along, Olivier helped me debug this and even came up with the solution!

Basically, I wasn't reading off the OS pipe concurrently and so if there was enough output, terraform would stall waiting for the pipe to empty out and Atlantis would wait until terraform finished executing before reading off the pipe ==> deadlock.

The fix was to read off the pipe concurrently but while testing the fix, I realized the terraform panic that my original code changes were meant to solve was actually being correctly caught using CombinedOutput(). My previous Terraform panics were using the 0.12 alpha build but when I built my own terraform binary with an embedded panic, CombinedOutput() had no issues. As a result, I've switched back to using CombinedOutput in the latest release (v0.4.15) and I would recommend no one use v0.4.14.

lkysow · 2019-02-11T21:13:53Z

Release: https://github.com/runatlantis/atlantis/releases/tag/v0.4.15

mignaulo · 2019-02-11T21:52:09Z

Thanks @lkysow!

lkysow added the bug Something isn't working label Feb 4, 2019

lkysow mentioned this issue Feb 11, 2019

Fix issues with terraform execution. #464

Merged

lkysow closed this as completed in #464 Feb 11, 2019

DenislavTsonev mentioned this issue Sep 24, 2021

Atlantis fails to complete plan #1830

Closed

ldunkum mentioned this issue Aug 31, 2023

Terraform signal:killed with long plan output #3721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atlantis fails to complete plan for a large environment. #452

Atlantis fails to complete plan for a large environment. #452

gvwirth commented Feb 1, 2019 •

edited

Loading

mignaulo commented Feb 1, 2019

lkysow commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 7, 2019

mignaulo commented Feb 7, 2019

lkysow commented Feb 7, 2019

mignaulo commented Feb 7, 2019 •

edited

Loading

lkysow commented Feb 7, 2019

lkysow commented Feb 11, 2019

lkysow commented Feb 11, 2019

mignaulo commented Feb 11, 2019

Atlantis fails to complete plan for a large environment. #452

Atlantis fails to complete plan for a large environment. #452

Comments

gvwirth commented Feb 1, 2019 • edited Loading

mignaulo commented Feb 1, 2019

lkysow commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 4, 2019

mignaulo commented Feb 4, 2019

lkysow commented Feb 7, 2019

mignaulo commented Feb 7, 2019

lkysow commented Feb 7, 2019

mignaulo commented Feb 7, 2019 • edited Loading

lkysow commented Feb 7, 2019

lkysow commented Feb 11, 2019

lkysow commented Feb 11, 2019

mignaulo commented Feb 11, 2019

gvwirth commented Feb 1, 2019 •

edited

Loading

mignaulo commented Feb 7, 2019 •

edited

Loading