-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden against disruptive loss of apiserver #58
Harden against disruptive loss of apiserver #58
Conversation
Hi @sanchezl. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
/test pull-kube-storage-version-migrator-fully-automated-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchezl can you say more on how did the storageState and the storageVersionMigration ended to be inconsistent?
manifests/namespace-rbac.yaml
Outdated
@@ -96,3 +96,22 @@ roleRef: | |||
kind: ClusterRole | |||
name: storage-version-migration-initializer | |||
apiGroup: rbac.authorization.k8s.io | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The storage-version-migration-migrator
gives the migrator the ability to get, list, update all resources, is that not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
storage-version-migration-migrator
gives the migrator the ability to get, list, update all resources, is that not enough?
@sanchezl fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I0128 23:00:56.303078 1 round_trippers.go:443] PUT https://10.0.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/edit 403 Forbidden in 39 milliseconds
I0128 23:00:56.303160 1 round_trippers.go:449] Response Headers:
I0128 23:00:56.303170 1 round_trippers.go:452] Audit-Id: aed0307a-49ed-4e0b-9514-92ed370669b7
I0128 23:00:56.303176 1 round_trippers.go:452] Content-Type: application/json
I0128 23:00:56.303182 1 round_trippers.go:452] Date: Tue, 28 Jan 2020 23:00:56 GMT
I0128 23:00:56.303455 1 request.go:1017] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"clusterroles.rbac.authorization.k8s.io \"edit\" is forbidden: user \"system:serviceaccount:kube-system:default\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:kube-system\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the storage-version-migration-migrator
role and fix the storage-version-migration-migrator
role binding?
I agree that the migrator does need the cluster-admin power. Otherwise it can hit the pasted error when migrating roles/clusterroles due to priviledge escalation.
metrics.Metrics.ObserveFailedMigration(resource(m).String()) | ||
return err | ||
klog.Errorf("%v: migration failed: %v", m.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you move this log line to before line 127? I feel that's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -93,6 +95,9 @@ func (m *migrator) Run() error { | |||
Continue: continueToken, | |||
}, | |||
) | |||
if errors.IsNotFound(listError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen only if the resource is deleted, e.g., the CRD is deleted when the migrator is migrating the CR. This should not be counted as a migration failure. I suggest that we log the error, but return nil.
(The current code isn't handling this correctly either.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone creates a StorageVersionMigration
for a CRD that does not exist, the migrator will be stuck here (unable to proceed due to its serial nature) unless we fail the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can put this in a comment?
pkg/trigger/discovery_handler.go
Outdated
switch { | ||
case len(migrations) == 0: | ||
// corresponding StorageVersionMigration resource missing | ||
relaunchMigration = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if storageVersionChanged is false, why do we need to relaunch the migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao StorageState created, but migration creation failed due to disruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think #58 (comment) is more readable.
pkg/trigger/discovery_handler.go
Outdated
relaunchMigration = true | ||
case ss.Status.PersistedStorageVersionHashes[0] == migrationv1alpha1.Unknown: | ||
migration := migrations[0].(*migrationv1alpha1.StorageVersionMigration) | ||
if controller.HasCondition(migration, migrationv1alpha1.MigrationFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what leads to this situation? The kubemigrator does not give up easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feb 4 19:44:47.629: INFO: resource {migration.k8s.io storagestates} has persisted hashes [Unknown], and current hash 7abAo0yHdNM=
Feb 4 19:44:47.629: INFO: timed out waiting for the condition
- apiVersion: migration.k8s.io/v1alpha1
kind: StorageVersionMigration
metadata:
creationTimestamp: "2020-02-04T19:26:54Z"
generateName: storagestates.migration.k8s.io-
generation: 1
name: storagestates.migration.k8s.io-kk589
resourceVersion: "3574"
selfLink: /apis/migration.k8s.io/v1alpha1/storageversionmigrations/storagestates.migration.k8s.io-kk589
uid: 7dc9f088-aeac-49f8-bbf7-a31a25b55fdb
spec:
resource:
group: migration.k8s.io
resource: storagestates
version: v1alpha1
status:
conditions:
- lastUpdateTime: "2020-02-04T19:30:49Z"
message: '[Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch:
unexpected EOF, Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/statefulsets.apps:
unexpected EOF]'
status: "True"
type: Failed
- apiVersion: migration.k8s.io/v1alpha1
kind: StorageState
metadata:
creationTimestamp: "2020-02-04T19:26:54Z"
generation: 1
name: storagestates.migration.k8s.io
resourceVersion: "5683"
selfLink: /apis/migration.k8s.io/v1alpha1/storagestates/storagestates.migration.k8s.io
uid: 4301aff1-bb56-4a9f-8562-cac09ce5031a
spec:
resource:
group: migration.k8s.io
resource: storagestates
status:
currentStorageVersionHash: 7abAo0yHdNM=
lastHeartbeatTime: "2020-02-04T19:44:11Z"
persistedStorageVersionHashes:
- Unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think the "Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch: unexpected EOF" message means that the apiserver restarts in the middle of handling the PUT request from the migrator/core. We should let the migrator/core retry in that case, to avoid re-migrating the entire list of Batches.
Apart from that, I agree that we should launch a migration if ss.Status.PersistedSotrageVersionHashes!=ss.Status.CurrentStoragteVersionHash && (there is no pending or running storageMigration). Instead of a switch block nested in the else clause, I would make another variable to capture this state, something like this:
func (mt *MigrationTrigger) processDiscoveryResource(r metav1.APIResource) {
...
stale := (getErr == nil && mt.staleStorageState(ss))
notFound := (getErr != nil && errors.IsNotFound(getErr))
// migrated checks if currentStorageVersionHash==persistedStorageVersionHashes,
needMigration := !migrated(ss) && !pendingOrRuningMigrations(ss)
// If storage version has changed, we need to restart any running storageMigration.
storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash)
if stale {
....
}
if stale || needMigration || storageVersionChanged || notFound {
relaunch...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao Done. PTAL.
462c76c
to
6f2c7a2
Compare
pkg/controller/kubemigrator.go
Outdated
metrics.Metrics.ObserveFailedMigration(resource(m).String()) | ||
return err | ||
return updateErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err
pkg/controller/kubemigrator.go
Outdated
return err | ||
} | ||
_, err = km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error()) | ||
klog.Errorf("%v: migration failed: %v", m.Name, err) | ||
_, updateErr := km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is non-nil, utilruntime.HandleError
6f2c7a2
to
00c6577
Compare
/test pull-kube-storage-version-migrator-disruptive |
b038ea8
to
1c69f3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, otherwise lgtm.
pkg/trigger/discovery_handler.go
Outdated
stale := (getErr == nil && mt.staleStorageState(ss)) | ||
storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash) | ||
notFound := (getErr != nil && errors.IsNotFound(getErr)) | ||
found := getErr == nil || !errors.IsNotFound(getErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the "|| !errors.IsNotFound(getErr)"? That case is handled by the return above and it's weird to call this condition "found" if it contains error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao Done
test/e2e/test-fully-automated.sh
Outdated
@@ -36,6 +36,16 @@ cleanup() { | |||
return | |||
fi | |||
pushd "${MIGRATOR_ROOT}" | |||
echo "===== initializer logs" | |||
kubectl logs --namespace=kube-system job/initializer || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't launch the initializer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao Done
test/e2e/test-fully-automated.sh
Outdated
@@ -59,3 +69,4 @@ pushd "${MIGRATOR_ROOT}" | |||
make e2e-test | |||
"${ginkgo}" -v "$@" "${MIGRATOR_ROOT}/test/e2e/e2e.test" | |||
popd | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao Done
1c69f3f
to
c3cc678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM. Please fix one more nit and squash. Feel free to get others to lgtm if I don't get to it soon.
pkg/migrator/core.go
Outdated
@@ -93,6 +95,10 @@ func (m *migrator) Run() error { | |||
Continue: continueToken, | |||
}, | |||
) | |||
if errors.IsNotFound(listError) { | |||
// fail this migration, we don't want to get stuck on a migration for a resource that does not exist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits:
s/fail/Fail
s/)/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao done
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: caesarxuchao, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c3cc678
to
8740b3a
Compare
/lgtm |
Great fix! Thanks. |
This fixes the flaky e2e tests. Loss of the apiserver at inopportune moments left the combined
StorageState
andStorageVersionMigration
resources out of sync with regards to the migrationtrigger
.