Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopted resource reconciler fails with already exists error message #886

Closed
ryansteakley opened this issue Aug 2, 2021 · 0 comments · Fixed by aws-controllers-k8s/runtime#41
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ryansteakley
Copy link
Member

Describe the bug
The adopted resource test in the sagemaker-controller repo is very flaky. It will fail very often at this stage of the test https://github.com/aws-controllers-k8s/sagemaker-controller/blob/871401b98cdf1eacd57486ed1464b8a49730bba3/test/e2e/tests/test_adopt_endpoint.py#L199.

Some of the errors for adoption seen in the controller logs are as such.

2021-07-28T06:36:00.973Z ERROR controller-runtime.controller Reconciler error {"controller": "adoptedresource", "request": "default/adopt-sdk-endpoint-crpyg28l0vtu74sitfd", " error": "endpoints.sagemaker.services.k8s.aws \"sdk-endpoint-crpyg28l0vtu74sitfd\" already exists"} github.com/go-logr/zapr.(*zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:258 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155 k8s.io/apimachinery/pkg/util/wait.BackoffUntil /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156 k8s.io/apimachinery/pkg/util/wait.JitterUntil /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133 k8s.io/apimachinery/pkg/util/wait.Until /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90

2021-07-28T06:35:54.548Z ERROR controller-runtime.controller Reconciler error {"controller": "endpoint", "request": "default/sdk-endpoint-crpyg28l0vtu74sitfd", "error": "Operation cannot be fulfilled on endpoints.sagemaker.services.k8s.aws \"sdk-endpoint-crpyg28l0vtu74sitfd\": the object has been modified; please apply your changes to the latest version and try again"} github.com/go-logr/zapr.(*zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:258 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155 k8s.io/apimachinery/pkg/util/wait.BackoffUntil /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156 k8s.io/apimachinery/pkg/util/wait.JitterUntil /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133 k8s.io/apimachinery/pkg/util/wait.Until /go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90

Steps to reproduce
Run sagemaker-controller adopted resource test

Expected outcome
Tests pass and no errors occur.

Environment
Sagemaker, EKS and local Kind cluster.

@ryansteakley ryansteakley added the kind/bug Categorizes issue or PR as related to a bug. label Aug 2, 2021
@RedbackThomson RedbackThomson changed the title Adoption Resource tests fail Adopted resource reconciler fails with already exists error message Aug 3, 2021
ack-bot pushed a commit to aws-controllers-k8s/runtime that referenced this issue Aug 4, 2021
Fixes: aws-controllers-k8s/community#886

The reconciler will indefinitely requeue even when there have been no changes made to the `spec` or `status`. This is happening in both the adoption and resource reconciler types.

The root cause comes from `controller-runtime` requeue-ing the resource when we patch the `status` subresource - see [this issue](kubernetes-sigs/kubebuilder#618) for details. This bug was introduced for the resource reconciler as part of #39 , since we now call `patchResourceStatus` on every reconcile loop.

This fix adds an event filter to each manager, with a predicate that the resource must have changed generation. The generation is not changed unless there has been a modification to the `spec`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant