Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] certs: simplified test scenarios for a race when multiple CertRotationControllers are used #1763

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

p0lyn0mial
Copy link
Contributor

This PR adds a series of unit tests that show/prove where the race condition is.

The description of the problem can be found at https://docs.google.com/document/d/1iOy-qYFj2mXnDxLPbkPic7ylNMf1QY6YctnUFjHOj14/edit

@openshift-ci openshift-ci bot requested review from deads2k and soltysh July 12, 2024 14:55
Copy link
Contributor

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: p0lyn0mial
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jul 12, 2024

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify ffa77ba link true /test verify
ci/prow/unit ffa77ba link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@p0lyn0mial
Copy link
Contributor Author

/cc @vrutkovs @tkashem


secondSigner, signerUpdated, err := target.EnsureSigningCertKeyPair(context.TODO())
require.NoError(t, err)
require.True(t, signerUpdated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we expect that signer would be updated here?


firstSigner, signerUpdated, err := target.EnsureSigningCertKeyPair(context.TODO())
require.NoError(t, err)
require.True(t, signerUpdated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't cache signer here. If a proper caching is added this will no longer generate different commonName signer:

	signerSecret := &corev1.Secret{
		ObjectMeta: metav1.ObjectMeta{
			Name:      "signer",
			Namespace: "ns",
		},
	}
	ensureMetadataUpdate(signerSecret, signer.Owner, signer.AdditionalAnnotations)
	err = setSigningCertKeyPairSecret(signerSecret, signer.Validity)
	require.NoError(t, err)
	coreV1Indexer.Add(signerSecret)


firstSecretCommonName := firstSignerSecret.Annotations[CertificateIssuer]
secondSecretCommonName := secondSignerSecret.Annotations[CertificateIssuer]
require.Equal(t, firstSecretCommonName, secondSecretCommonName)
Copy link
Member

@vrutkovs vrutkovs Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on common name to include unix timestamp, so it would pass in the unit test, but not real run.

Adding time.Sleep(time.Second * 1) between setSigningCertKeyPairSecret calls "fixes" the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating two different signers with the same common name is possible in a real run.
In fact this already happened in the e2e you wrote.

See: #1722 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what I'm saying is this test is not really indicative for the fix, the actual solution would prevent this situation from happening - so later on there is no way to fix the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend to merge this PR. The tests here helped me understood where the race condition is as described in the linked docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual solution would prevent this situation from happening - so later on there is no way to fix the test

this is not true, switching from apply to update will not change the function that generates the common name.

Copy link
Member

@vrutkovs vrutkovs Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to pass an additional RV. The RV is read from the resource Give it a try.

In that case just using "optimistic lock" is insufficient.

CABundle uses applyConfigMap which is also Update:

actual, err := client.ConfigMaps(required.Namespace).Update(ctx, existingCopy, metav1.UpdateOptions{})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the apply function is more like overwrite, we need to change the controller/struct/type to issue an update for a resource it has currently reconciled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to change the controller/struct/type to issue an update for a resource it has currently reconciled

This is what its doing now, isn't it? I don't quite understand what needs fixing and why the bug resurfaces for this controller only - it sounds like it could have happened across any library-go user but it didn't?

In any case, we have an alternative solution which has unittests and passing e2e tests - #1722

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what needs fixing and why the bug resurfaces for this controller only

Have a look at the doc linked in the description of this PR. It contains more details about the possible race condition.

In any case, we have an alternative solution which has unittests and passing e2e tests - #1722

#1722 is not going to solve the issue because the controllers run across multiple processes.

@vrutkovs why don't you sync with @deads2k or @soltysh on #1722 and ask them for their feedback ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1722 fixes the thread level and openshift/cluster-kube-apiserver-operator#1669 will fix process level races

_, err = fakeKubeClient.CoreV1().ConfigMaps("ns").Get(context.Background(), "trust-bundle", metav1.GetOptions{})
require.NoError(t, err)
// by not adding the caBundleConfigMap
// to the lister, we simulate the race condition
Copy link
Member

@vrutkovs vrutkovs Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. Not adding it to a lister will make sure it would be created from scratch, which is different from "racing update" as it doesn't have necessary metadata.

So this test is showing a race between "controller tries to create a new configmap" and "controller tries to update the configmap", not "two controllers try to update the configmap simultaneously". Of course, in this case it won't be able to preserve the first signer - its been lost forever in the unit test


res, err = target.EnsureConfigMapCABundle(context.TODO(), secondSigner)
require.NoError(t, err)
// by not adding the raBundleConfigMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what would be the fix here. There is no way to fix EnsureConfigMapCABundle to make it pass as we intentionally ignore result of this function call

@p0lyn0mial
Copy link
Contributor Author

These tests are simplified. I don't intend to merge this PR.
The tests here helped me understood where the race condition is as described in the linked docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants