Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(checkout-strategy): On merge strategy, reclone when necessary on base update #3187

Merged
merged 8 commits into from
Mar 25, 2023

Conversation

finnag
Copy link
Contributor

@finnag finnag commented Mar 3, 2023

what

When using the merge checkout strategy, check if the base branch has
been updated before each plan operation and merge again if so.

why

The current "merge" checkout strategy is unsafe. It merges the PR and
the base branch without holding the directory lock(s), so there is a
potentially very long window where another PR can be applied and be
unexpectedly reverted later.

Instead of printing a warning when this happens, we now merge again
if necessary while holding the lock. Plans are then guaranteed to only
be made when merged with the base branch for each directory being planned,
and applying later should be safe because of the locks, even if the base
branch sees further updates.

tests

  • I have tested my changes by make test && make test-all

references

finnag added 2 commits March 3, 2023 15:52
The current "merge" checkout strategy is unsafe. It merges the PR and the base
branch without holding the directory lock(s), so there is a potentially
very long window where another PR can be applied and be unexpectedly reverted
later.

This happens occasionally if a PR causes plans in multiple directories, but
is almost _guaranteed_ to happen if the initial plan has to wait until a
lock is freed up and a manual "atlantis plan" command is given.

Instead of printing a warning when this happens, we now merge again
if necessary while holding the lock. Plans are then guaranteed to only
be made when merged with the base branch for each directory being planned,
and applying later should be safe even if the base branch sees further
updates.

This fixes/affects runatlantis#804, runatlantis#867, runatlantis#979
@finnag finnag requested a review from a team as a code owner March 3, 2023 15:09
@github-actions github-actions bot added the go Pull requests that update Go code label Mar 3, 2023
@nitrocode
Copy link
Member

@finnag please fill out the pr template

@finnag
Copy link
Contributor Author

finnag commented Mar 4, 2023

Updated desc to match template

@nitrocode
Copy link
Member

Thanks @finnag

Please fix the broken test.

Please also test this in your environment to ensure that it's working as expected. Test with both branch and merge strategy please.

@nitrocode nitrocode changed the title Re-clone when necessary on merge base update fix(checkout-strategy): On merge strategy, reclone when necessary on base update Mar 6, 2023
@finnag
Copy link
Contributor Author

finnag commented Mar 6, 2023

The test that times out looks like a flaky test on main.
To reproduce:

while :; do go clean -testcache; go test -timeout 20s github.com/runatlantis/atlantis/server/jobs; done

Seems to fail with a timeout now and then, and sometimes you get this:

--- FAIL: TestProjectCommandOutputHandler (0.02s)
    --- FAIL: TestProjectCommandOutputHandler/clean_up_all_jobs_when_PR_is_closed (0.00s)
        project_command_output_handler_test.go:177: 
            	Error Trace:	/Users/finnag/atlantis/server/jobs/project_command_output_handler_test.go:177
            	Error:      	Should be empty, but was map[1234:true]
            	Test:       	TestProjectCommandOutputHandler/clean_up_all_jobs_when_PR_is_closed

@finnag
Copy link
Contributor Author

finnag commented Mar 6, 2023

Even better:

go test -timeout 20s -count 30 github.com/runatlantis/atlantis/server/jobs

usually fails on my machine. This is with main at 5be47fb

@finnag
Copy link
Contributor Author

finnag commented Mar 6, 2023

The following patch will cause the test to fail reliably (maybe this should be a separate bug report for the github.com/runatlantis/atlantis/server/jobs test):

--- a/server/jobs/project_command_output_handler_test.go
+++ b/server/jobs/project_command_output_handler_test.go
@@ -78,6 +78,7 @@ func TestProjectCommandOutputHandler(t *testing.T) {
 
                // read from channel
                go func() {
+                       time.Sleep(40 * time.Millisecond) // This sleep makes the test time out ALWAYS.
                        for msg := range ch {
                                expectedMsg = msg
                                wg.Done()

It causes func (p *AsyncProjectCommandOutputHandler) writeLogLine(jobID string, line string) { in server/jobs/project_command_output_handler.go to hit the

                default:
                        // Delete buffered channel if it's blocking.
                        delete(p.receiverBuffers[jobID], ch)

case, and then the message is lost.

Limited time to track through the code today, so not sure exactly what is going on.

Test failure:

$ go test -v -timeout 1s -count 1 github.com/runatlantis/atlantis/server/jobs
=== RUN   TestJobURLSetter
=== RUN   TestJobURLSetter/update_project_status_with_project_jobs_url
=== RUN   TestJobURLSetter/update_project_status_with_project_jobs_url_error
--- PASS: TestJobURLSetter (0.00s)
    --- PASS: TestJobURLSetter/update_project_status_with_project_jobs_url (0.00s)
    --- PASS: TestJobURLSetter/update_project_status_with_project_jobs_url_error (0.00s)
=== RUN   TestProjectCommandOutputHandler
=== RUN   TestProjectCommandOutputHandler/receive_message_from_main_channel
panic: test timed out after 1s
...

@nitrocode
Copy link
Member

Hi @finnag any update on the test failure? Is this something you can solve or is it something that can be solved in a separate ticket and pr?

@runatlantis runatlantis deleted a comment from jamengual Mar 17, 2023
@runatlantis runatlantis deleted a comment from frtr1p Mar 17, 2023
@nitrocode nitrocode added the waiting-on-response Waiting for a response from the user label Mar 20, 2023
@finnag
Copy link
Contributor Author

finnag commented Mar 21, 2023

The test failure was fixed in a separate PR already.

I've run this PR in various test setups, and it performs as intended - the usual workflows where you want to merge now work.

@nitrocode nitrocode added waiting-on-review Waiting for a review from a maintainer and removed waiting-on-response Waiting for a response from the user labels Mar 21, 2023
@jamengual
Copy link
Contributor

@finnag have you tested this in monorepo or polirepo with workspaces and multiples dir?

I'm concerned that when locks are per dir and or workspace we could potentially have a case where one PR against the same repo could pull HEAD but all other PRs on the same repo be behind.

will this work if there are plans already existing on PRs and are ready to apply by running atlantis apply? ( will it pull before trying to apply?)

@finnag
Copy link
Contributor Author

finnag commented Mar 23, 2023

We're running it live in several monorepos now that have a lot of directories, and it seems to work well, there is no longer any need to manually merge/rebase PRs before planning or applying. It relies on the directory lock to keep the directory safe from modification by other PRs in the period between plan and apply.

The flow per directory is: lock -> [internal merge with HEAD] -> plan -> apply -> merge PR -> unlock.

The idea is that as long as the lock is held, no other PR can merge anything to main that needs the same lock, so there is no need to resync with main before the apply step.

@nitrocode nitrocode removed the waiting-on-review Waiting for a review from a maintainer label Mar 23, 2023
@nitrocode nitrocode added this to the v0.23.4 milestone Mar 23, 2023
Copy link
Member

@GenPage GenPage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocking, just had a single comment to be aware of. Maybe leave a code comment for future coders as a TODO that the boolean to warn users can be refactored out in a later PR if its no longer needed.

finnag added 2 commits March 24, 2023 15:33
We now verify that the first Clone with CheckoutMerge=true with
a diverged base branch atually clones and merges again.
@nitrocode nitrocode merged commit 1e00e2f into runatlantis:main Mar 25, 2023
@nitrocode
Copy link
Member

Thank you @finnag !

nitrocode pushed a commit that referenced this pull request May 5, 2023
…n base update (#3187)

* merge again if base branch has been updated

The current "merge" checkout strategy is unsafe. It merges the PR and the base
branch without holding the directory lock(s), so there is a potentially
very long window where another PR can be applied and be unexpectedly reverted
later.

This happens occasionally if a PR causes plans in multiple directories, but
is almost _guaranteed_ to happen if the initial plan has to wait until a
lock is freed up and a manual "atlantis plan" command is given.

Instead of printing a warning when this happens, we now merge again
if necessary while holding the lock. Plans are then guaranteed to only
be made when merged with the base branch for each directory being planned,
and applying later should be safe even if the base branch sees further
updates.

This fixes/affects #804, #867, #979

* Remove diverged test that no longer applies

* Reinstate TestClone_MasterHasDiverged test and make it pass

* Extend TestClone_MasterHasDiverged to test new merging functionality

We now verify that the first Clone with CheckoutMerge=true with
a diverged base branch atually clones and merges again.

---------

Co-authored-by: PePe Amengual <jose.amengual@gmail.com>
ijames-gc pushed a commit to gocardless/atlantis that referenced this pull request Feb 13, 2024
…n base update (runatlantis#3187)

* merge again if base branch has been updated

The current "merge" checkout strategy is unsafe. It merges the PR and the base
branch without holding the directory lock(s), so there is a potentially
very long window where another PR can be applied and be unexpectedly reverted
later.

This happens occasionally if a PR causes plans in multiple directories, but
is almost _guaranteed_ to happen if the initial plan has to wait until a
lock is freed up and a manual "atlantis plan" command is given.

Instead of printing a warning when this happens, we now merge again
if necessary while holding the lock. Plans are then guaranteed to only
be made when merged with the base branch for each directory being planned,
and applying later should be safe even if the base branch sees further
updates.

This fixes/affects runatlantis#804, runatlantis#867, runatlantis#979

* Remove diverged test that no longer applies

* Reinstate TestClone_MasterHasDiverged test and make it pass

* Extend TestClone_MasterHasDiverged to test new merging functionality

We now verify that the first Clone with CheckoutMerge=true with
a diverged base branch atually clones and merges again.

---------

Co-authored-by: PePe Amengual <jose.amengual@gmail.com>
ijames-gc pushed a commit to gocardless/atlantis that referenced this pull request Feb 13, 2024
…n base update (runatlantis#3187)

* merge again if base branch has been updated

The current "merge" checkout strategy is unsafe. It merges the PR and the base
branch without holding the directory lock(s), so there is a potentially
very long window where another PR can be applied and be unexpectedly reverted
later.

This happens occasionally if a PR causes plans in multiple directories, but
is almost _guaranteed_ to happen if the initial plan has to wait until a
lock is freed up and a manual "atlantis plan" command is given.

Instead of printing a warning when this happens, we now merge again
if necessary while holding the lock. Plans are then guaranteed to only
be made when merged with the base branch for each directory being planned,
and applying later should be safe even if the base branch sees further
updates.

This fixes/affects runatlantis#804, runatlantis#867, runatlantis#979

* Remove diverged test that no longer applies

* Reinstate TestClone_MasterHasDiverged test and make it pass

* Extend TestClone_MasterHasDiverged to test new merging functionality

We now verify that the first Clone with CheckoutMerge=true with
a diverged base branch atually clones and merges again.

---------

Co-authored-by: PePe Amengual <jose.amengual@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code
Projects
None yet
5 participants