Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to controller-runtime 0.8.1 #4098

Conversation

charith-elastic
Copy link
Contributor

@charith-elastic charith-elastic commented Jan 7, 2021

This is a large PR mainly due to the controller-runtime Client interface switching to use client.Object instead of runtime.Object and the Reconciler interface changing to pass a context around. Due to these changes, I also took the opportunity to remove the k8s.Client abstraction we were using to ignore the context parameter.

The full list of changes can be seen at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.7.0

Other notable changes are:

  • Upgrade to Kubernetes 1.20 libraries
  • Upgrade to controller-tools 0.4.1

Leader election defaults to configmapleases in 0.7.0 but it is not supported in Kubernetes 1.13. I have opted to retain the configmaps method for now to maintain that support.

There's still some cleanup/optimization work to do due to the switch to client.Object and not ignoring the context parameter in API calls. These are hard to do in an automated fashion so I'll slowly go through the code base and raise PRs to fix them as necessary.

@charith-elastic charith-elastic added >enhancement Enhancement of existing functionality exclude-from-release-notes Exclude this PR from appearing in the release notes labels Jan 7, 2021
@charith-elastic charith-elastic marked this pull request as ready for review January 7, 2021 15:55
@charith-elastic charith-elastic changed the title Upgrade to controller-runtime 0.7.0 Upgrade to controller-runtime 0.8.1 Jan 25, 2021
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do some tests to understand the changes related to signals.SetupSignalHandler()

First I noticed that when I change the configuration file it is not always fsnotify.Write or fsnotify.Create which are raised. I added a log message and found out that it can also be CHMOD:

{"log.level":"info","@timestamp":"2021-01-25T09:34:42.859Z","log.logger":"manager","message":"OnConfigChange","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","evt":"\"/conf/..2021_01_25_09_34_42.455914645\": CHMOD"}

But with the proposed code change, even if CREATE is raised the operator does not exit:

{"log.level":"info","@timestamp":"2021-01-25T09:35:51.934Z","log.logger":"manager","message":"OnConfigChange","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","evt":"\"/conf/..2021_01_25_09_35_51.940876528\": CREATE"}
{"log.level":"info","@timestamp":"2021-01-25T09:35:51.934Z","log.logger":"manager","message":"Shutting down to apply updated configuration","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0"}

// I guess that the Pod should have been terminated, but it is still running:

{"log.level":"debug","@timestamp":"2021-01-25T09:35:54.646Z","log.logger":"resource","message":"Saving","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","configmap_name":"elastic-licensing","license_info":{"Timestamp":"2021-01-25T09:35:54Z","EckLicenseLevel":"basic","TotalManagedMemory":0,"MaxEnterpriseResourceUnits":0,"EnterpriseResourceUnits":0}}
{"log.level":"info","@timestamp":"2021-01-25T09:35:54.646Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"debug","@timestamp":"2021-01-25T09:37:54.646Z","log.logger":"resource","message":"Saving","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","configmap_name":"elastic-licensing","license_info":{"Timestamp":"2021-01-25T09:37:54Z","EckLicenseLevel":"basic","TotalManagedMemory":0,"MaxEnterpriseResourceUnits":0,"EnterpriseResourceUnits":0}}
{"log.level":"info","@timestamp":"2021-01-25T09:37:54.646Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"debug","@timestamp":"2021-01-25T09:39:54.646Z","log.logger":"resource","message":"Saving","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","configmap_name":"elastic-licensing","license_info":{"Timestamp":"2021-01-25T09:39:54Z","EckLicenseLevel":"basic","TotalManagedMemory":0,"MaxEnterpriseResourceUnits":0,"EnterpriseResourceUnits":0}}
{"log.level":"info","@timestamp":"2021-01-25T09:39:54.646Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.5.0-SNAPSHOT+1b8dd4ab","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}

I'm still trying to understand why but I wanted to share my findings.

Makefile Show resolved Hide resolved
@barkbay
Copy link
Contributor

barkbay commented Jan 25, 2021

First I noticed that when I change the configuration file it is not always fsnotify.Write or fsnotify.Create which are raised. I added a log message and found out that it can also be CHMOD:

It turns out that it can also be RENAME (I'm testing on GKE):

{"log.level":"info","@timestamp":"2021-01-25T10:20:10.842Z","log.logger":"manager","message":"OnConfigChange","service.version":"1.5.0-SNAPSHOT+598c8d5a","service.type":"eck","ecs.version":"1.4.0","evt":"\"/conf/..data_tmp\": RENAME"}

I waited several minutes, it is the only event I saw when I updated the config file. The operator has not been restarted.
I'm wondering if we should not open an issue to investigate what kind of events are expected and maybe allow all of them, wdyt ?

@charith-elastic
Copy link
Contributor Author

@barkbay I think I found the reason why the operator does not restart on config change. When the config update signal is received, the code waits for an error to be returned from the error channel. Obviously, that never happens. This is the change: d8f633a

@charith-elastic
Copy link
Contributor Author

I think the full PR build was disabled some time ago 😞

Looks like there's another bug in the controller-runtime log package. It only manifests when the tests are run in race mode. I think I found the problem and I will raise a PR to fix it. Since it's only an issue with the logger and doesn't have a real impact on the correctness of the operator, I don't think it's worth holding up this PR based on that. So please continue your review and ignore the PR build failure for the time being.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm halfway through the review process.
In the meantime I manually started a full e2e session on one of my GKE cluster.

cmd/manager/main.go Show resolved Hide resolved
pkg/utils/k8s/client.go Show resolved Hide resolved

return f(context.Background())
func (fc failingClient) List(ctx context.Context, list client.ObjectList, opts ...client.ListOption) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: parameters can be removed or replaced with _ when not used. That being said since those functions are pretty obvious I'm not sure it's really worth it.

Suggested change
func (fc failingClient) List(ctx context.Context, list client.ObjectList, opts ...client.ListOption) error {
func (fc failingClient) List(_ context.Context, _ client.ObjectList, _ ...client.ListOption) error {

pkg/controller/association/reconciler_test.go Show resolved Hide resolved
pkg/controller/common/license/watches.go Show resolved Hide resolved
pkg/controller/common/service_control.go Show resolved Hide resolved
pkg/controller/elasticsearch/label/label.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
pkg/controller/association/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/association/reconciler_test.go Show resolved Hide resolved
pkg/controller/common/license/watches.go Show resolved Hide resolved
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost lgtm, still need to run some e2e tests and review the observer refactoring.

image

@@ -382,18 +382,70 @@ spec:
spec:
description: Spec is the specification of the service.
properties:
allocateLoadBalancerNodePorts:
description: allocateLoadBalancerNodePorts defines if NodePorts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from your PR but these line breaks seem to be a bug: kubernetes-sigs/kustomize#947

@@ -27,131 +27,131 @@ func Test_defaultDriver_maybeForceUpgradePods(t *testing.T) {
{
name: "no pods to upgrade",
actualPods: []corev1.Pod{
sset.TestPod{Name: "pod1", StatefulSetName: "ssetA", Ready: false}.Build(),
sset.TestPod{Name: "pod2", StatefulSetName: "ssetB", Ready: true}.Build(),
sset.TestPod{Name: "pod1", StatefulSetName: "ssetA", Ready: false, ResourceVersion: "999"}.Build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just hardcode ResourceVersion in the Build() function if we only use that value ? I don't feel it brings value to have it here.

	return corev1.Pod{
		ObjectMeta: metav1.ObjectMeta{
			Namespace:       t.Namespace,
			Name:            t.Name,
			Labels:          labels,
			ResourceVersion: "999",
		},
		Status: status,
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it configurable in case the future tests we add would actually need to modify the resource version. For the existing tests, yes, I agree that it can be hard-coded.

pkg/license/reporter.go Outdated Show resolved Hide resolved
pkg/utils/compare/objects.go Outdated Show resolved Hide resolved
pkg/utils/k8s/client.go Outdated Show resolved Hide resolved
pkg/utils/k8s/k8sutils.go Show resolved Hide resolved
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to go through everything once. It looks good to me. I think the observer lock refactoring is fine, but maybe a comment would be nice to motivate it.

pkg/controller/remoteca/watches.go Outdated Show resolved Hide resolved
@charith-elastic
Copy link
Contributor Author

The reason the PR build is failing is because the race detector is not happy with a recent change to the controller-runtime: kubernetes-sigs/controller-runtime#1359 (comment)

The way to make the tests pass would be to call log.SetLogger when tests are run. The problem is that because Go executes tests package-by-package, there's no way to globally set the logger at the start of a test run without actually modifying every package. I have opted for a different approach based on an environment variable: d36d8d9

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E tests are still running on my cluster 😄

NAMESPACE                  NAME                                                        READY   STATUS      RESTARTS   AGE
e2e-82cmu-elastic-system   e2e-82cmu-operator-0                                        1/1     Running     1          5h19m

I guess it would have already failed if something was really wrong.
So, I think I'm 👍 with those changes.

Nice work @charith-elastic !

.ci/setenvconfig Outdated
@@ -392,6 +402,7 @@ BUILD_NUMBER=$BUILD_NUMBER
E2E_PROVIDER=gke
CLUSTER_NAME=eck-pr-$BUILD_NUMBER
TEST_OPTS=-race
E2E_TEST_LOG_LEVEL=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! 🤦🏼

// Introduced mainly as a workaround for a controller-runtime bug.
// https://github.com/kubernetes-sigs/controller-runtime/issues/1359#issuecomment-767413330
// However, it is still useful in general to adjust the log level during test runs.
if logLevel, err := strconv.Atoi(os.Getenv(testLogLevelEnvVar)); err == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder can we not just use the regular log verbosity setting to work around this issue? --log-verbosity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that wouldn't work because the tests don't go through the main entrypoint -- where that flag is handled. 😞

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok. But don't you have to pass on this additional environment variable through the Docker image build and/or the e2e k8s job definition for this to work in non-local e2e test runs?

@charith-elastic
Copy link
Contributor Author

Jenkins test this please

.ci/setenvconfig Outdated Show resolved Hide resolved
@charith-elastic
Copy link
Contributor Author

Sorry about the commit spam. The PR is now ready to merge (I hope). Please have a look.

Thank you very much for wading through it @barkbay and @pebrc. I know it wasn't easy.

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@@ -450,7 +447,7 @@ e2e-generate-xml:

# Verify e2e tests compile with no errors, don't run them
e2e-compile:
ECK_TEST_LOG_LEVEL=$(ECK_TEST_LOG_LEVEL) go test ./test/e2e/... -run=dryrun -tags=$(E2E_TAGS) $(TEST_OPTS) > /dev/null
ECK_TEST_LOG_LEVEL=$(LOG_VERBOSITY) go test ./test/e2e/... -run=dryrun -tags=$(E2E_TAGS) $(TEST_OPTS) > /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed out of band this one is probably not necessary.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM 👍

@charith-elastic charith-elastic merged commit a0fe450 into elastic:master Jan 27, 2021
@charith-elastic charith-elastic deleted the enhancement/controller-runtime-upgrade branch January 27, 2021 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality exclude-from-release-notes Exclude this PR from appearing in the release notes v1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants