Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an issue with resuming a failed container cluster creation #7121

Merged
merged 2 commits into from
Jan 26, 2023

Conversation

trodge
Copy link
Contributor

@trodge trodge commented Jan 12, 2023

fixes b/228111747

fixes hashicorp/terraform-provider-google#11431

If this PR is for Terraform, I acknowledge that I have:

  • Searched through the issue tracker for an open issue that this either resolves or contributes to, commented on it to claim it, and written "fixes {url}" or "part of {url}" in this PR description. If there were no relevant open issues, I opened one and commented that I would like to work on it (not necessary for very small changes).
  • Generated Terraform, and ran make test and make lint to ensure it passes unit and linter tests.
  • Ensured that all new fields I added that can be set by a user appear in at least one example (for generated resources) or third_party test (for handwritten resources or update tests).
  • Ran relevant acceptance tests (If the acceptance tests do not yet pass or you are unable to run them, please let your reviewer know).
  • Read the Release Notes Guide before writing my release note below.

Release Note Template for Downstream PRs (will be copied)

container: fixed an issue with resuming failed cluster creation

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

Terraform GA: Diff ( 3 files changed, 226 insertions(+), 3 deletions(-))
Terraform Beta: Diff ( 3 files changed, 226 insertions(+), 3 deletions(-))
TF Validator: Diff ( 2 files changed, 3 insertions(+), 3 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 2408
Passed tests 2155
Skipped tests: 251
Failed tests: 2

Action taken

Triggering VCR tests in RECORDING mode for the tests that failed during VCR. Click here to see the failed tests
TestAccFirebaserulesRelease_BasicRelease|TestAccContainerCluster_failedCreation

@modular-magician
Copy link
Collaborator

Tests passed during RECORDING mode:
TestAccFirebaserulesRelease_BasicRelease[Debug log]

Tests failed during RECORDING mode:
TestAccContainerCluster_failedCreation[Error message] [Debug log]

Please fix these to complete your PR
View the build log or the debug log for each test

@trodge
Copy link
Contributor Author

trodge commented Jan 13, 2023

/gcbrun

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

Terraform GA: Diff ( 3 files changed, 226 insertions(+), 3 deletions(-))
Terraform Beta: Diff ( 3 files changed, 226 insertions(+), 3 deletions(-))
TF Validator: Diff ( 2 files changed, 3 insertions(+), 3 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 2409
Passed tests 2155
Skipped tests: 251
Failed tests: 3

Action taken

Triggering VCR tests in RECORDING mode for the tests that failed during VCR. Click here to see the failed tests
TestAccFirebaserulesRelease_BasicRelease|TestAccContainerCluster_failedCreation|TestAccRegionInstanceGroupManager_stateful

@modular-magician
Copy link
Collaborator

Tests passed during RECORDING mode:
TestAccFirebaserulesRelease_BasicRelease[Debug log]
TestAccContainerCluster_failedCreation[Debug log]

Tests failed during RECORDING mode:
TestAccRegionInstanceGroupManager_stateful[Error message] [Debug log]

Please fix these to complete your PR
View the build log or the debug log for each test

@trodge trodge requested a review from melinath January 13, 2023 01:46
@melinath melinath requested review from a team and roaks3 and removed request for melinath and a team January 20, 2023 22:02
@melinath
Copy link
Member

reassigning to a random team member.

Copy link
Contributor

@roaks3 roaks3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also fix hashicorp/terraform-provider-google#11431?

I had a few questions but overall solution looks good.

@@ -2247,11 +2249,43 @@ func resourceContainerClusterRead(d *schema.ResourceData, meta interface{}) erro
}
waitErr := containerOperationWait(config, op, project, location, "resuming GKE cluster", userAgent, d.Timeout(schema.TimeoutRead))
if waitErr != nil {
// Check if the create operation failed because Terraform was prematurely terminated. If it was we can persist the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else reading, it looks like this block is an exact copy from what is done in the *Create function above.

select {
case <-config.context.Done():
log.Printf("[DEBUG] Persisting %s so this operation can be resumed \n", op.Name)
if err := d.Set("operation", op.Name); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this result in an infinite loop situation? It seems like a user might be able to get stuck in some situation where the operation never terminates properly and they can't get a terraform plan to succeed. IMO, one retry probably makes sense, and I think that's why the operation field is unset immediately above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the test still passes without this part.

Steps: []resource.TestStep{
{
Config: testAccContainerCluster_failedCreation(clusterName, project.ProjectId),
ExpectError: regexp.MustCompile(".*timeout while waiting for state to become 'DONE'.*"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the .*s for these regexes? My understanding was that matching a subsection was valid, not necessarily the whole string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the .*s.

@@ -358,6 +359,136 @@ func BootstrapServicePerimeterProjects(t *testing.T, desiredProjects int) []*clo
return projects
}

func removeContainerServiceAgentRoleFromContainerEngineRobot(t *testing.T, project *cloudresourcemanager.Project) {
config := BootstrapConfig(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are using BootstrapConfig, I don't believe the requests being made in this function will get recorded or replayed by the VCR. That means the requests will be made for every replay (ie. multiple per PR). You should be able to test with multiple GCB runs in this PR.

If the requests aren't recorded, it might be worth checking the policy first and only calling "Set" if it needs to be updated. That would also make it more like our other bootstrap functions, so you could potentially use that pattern more explicitly, ie. BootstrapProjectWithInvalidContainerEngineRobot(). This assumes that you never need to reset this project to the original state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this function only set the policy when changes are required.

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

Terraform GA: Diff ( 3 files changed, 239 insertions(+), 3 deletions(-))
Terraform Beta: Diff ( 3 files changed, 239 insertions(+), 3 deletions(-))
TF Validator: Diff ( 2 files changed, 3 insertions(+), 3 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 2429
Passed tests 2173
Skipped tests: 254
Failed tests: 2

Action taken

Triggering VCR tests in RECORDING mode for the tests that failed during VCR. Click here to see the failed tests
TestAccContainerCluster_failedCreation|TestAccRegionInstanceGroupManager_stateful

@modular-magician
Copy link
Collaborator

Tests passed during RECORDING mode:
TestAccContainerCluster_failedCreation[Debug log]

Tests failed during RECORDING mode:
TestAccRegionInstanceGroupManager_stateful[Error message] [Debug log]

Please fix these to complete your PR
View the build log or the debug log for each test

@trodge trodge requested a review from roaks3 January 25, 2023 21:26
Copy link
Contributor

@roaks3 roaks3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still add "Fixes hashicorp/terraform-provider-google#11431" to the PR description if it will be fixed to get the auto-close behavior, but otherwise LGTM

@rileykarson
Copy link
Member

@trodge: I've been seeing this pop up on VCR runs through the day. Is it possible that the bootstrap functions mean we need to exclude this from VCR?

ericayyliu pushed a commit to ericayyliu/magic-modules that referenced this pull request Jul 26, 2023
…eCloudPlatform#7121)

* Fix an issue with resuming a failed container cluster creation and add a test

* Do not persist operation when resuming during read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster"
5 participants