Harden against disruptive loss of apiserver #58

sanchezl · 2020-02-18T16:23:36Z

This fixes the flaky e2e tests. Loss of the apiserver at inopportune moments left the combined StorageState and StorageVersionMigration resources out of sync with regards to the migration trigger.

k8s-ci-robot · 2020-02-18T16:23:44Z

Hi @sanchezl. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k · 2020-02-18T16:37:22Z

/ok-to-test

sanchezl · 2020-02-18T16:59:52Z

/test pull-kube-storage-version-migrator-fully-automated-e2e

caesarxuchao

@sanchezl can you say more on how did the storageState and the storageVersionMigration ended to be inconsistent?

caesarxuchao · 2020-02-18T18:52:50Z

manifests/namespace-rbac.yaml

@@ -96,3 +96,22 @@ roleRef:
  kind: ClusterRole
  name: storage-version-migration-initializer
  apiGroup: rbac.authorization.k8s.io
+---


The storage-version-migration-migrator gives the migrator the ability to get, list, update all resources, is that not enough?

The storage-version-migration-migrator gives the migrator the ability to get, list, update all resources, is that not enough?

@sanchezl fix

From https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kube-storage-version-migrator/47/pull-kube-storage-version-migrator-manually-launched-e2e/1222290262227685376/build-log.txt

I0128 23:00:56.303078 1 round_trippers.go:443] PUT https://10.0.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/edit 403 Forbidden in 39 milliseconds I0128 23:00:56.303160 1 round_trippers.go:449] Response Headers: I0128 23:00:56.303170 1 round_trippers.go:452] Audit-Id: aed0307a-49ed-4e0b-9514-92ed370669b7 I0128 23:00:56.303176 1 round_trippers.go:452] Content-Type: application/json I0128 23:00:56.303182 1 round_trippers.go:452] Date: Tue, 28 Jan 2020 23:00:56 GMT I0128 23:00:56.303455 1 request.go:1017] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"clusterroles.rbac.authorization.k8s.io \"edit\" is forbidden: user \"system:serviceaccount:kube-system:default\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:kube-system\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:

Can you remove the storage-version-migration-migrator role and fix the storage-version-migration-migrator role binding?

I agree that the migrator does need the cluster-admin power. Otherwise it can hit the pasted error when migrating roles/clusterroles due to priviledge escalation.

caesarxuchao · 2020-02-18T18:58:13Z

pkg/controller/kubemigrator.go

 	metrics.Metrics.ObserveFailedMigration(resource(m).String())
-	return err
+	klog.Errorf("%v: migration failed: %v", m.Name, err)


nit: Can you move this log line to before line 127? I feel that's easier to read.

caesarxuchao · 2020-02-18T19:17:57Z

pkg/migrator/core.go

@@ -93,6 +95,9 @@ func (m *migrator) Run() error {
 				Continue: continueToken,
 			},
 		)
+		if errors.IsNotFound(listError) {


This should happen only if the resource is deleted, e.g., the CRD is deleted when the migrator is migrating the CR. This should not be counted as a migration failure. I suggest that we log the error, but return nil.

(The current code isn't handling this correctly either.)

If someone creates a StorageVersionMigration for a CRD that does not exist, the migrator will be stuck here (unable to proceed due to its serial nature) unless we fail the migration.

Got it. Can put this in a comment?

caesarxuchao · 2020-02-18T20:52:20Z

pkg/trigger/discovery_handler.go

+		switch {
+		case len(migrations) == 0:
+			// corresponding StorageVersionMigration resource missing
+			relaunchMigration = true


if storageVersionChanged is false, why do we need to relaunch the migration?

@caesarxuchao StorageState created, but migration creation failed due to disruption.

Got it. I think #58 (comment) is more readable.

caesarxuchao · 2020-02-18T22:28:32Z

pkg/trigger/discovery_handler.go

+			relaunchMigration = true
+		case ss.Status.PersistedStorageVersionHashes[0] == migrationv1alpha1.Unknown:
+			migration := migrations[0].(*migrationv1alpha1.StorageVersionMigration)
+			if controller.HasCondition(migration, migrationv1alpha1.MigrationFailed) {


Can you describe what leads to this situation? The kubemigrator does not give up easily.

See https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kube-storage-version-migrator/47/pull-kube-storage-version-migrator-disruptive/1224770718261055490/build-log.txt

Feb 4 19:44:47.629: INFO: resource {migration.k8s.io storagestates} has persisted hashes [Unknown], and current hash 7abAo0yHdNM= Feb 4 19:44:47.629: INFO: timed out waiting for the condition

- apiVersion: migration.k8s.io/v1alpha1 kind: StorageVersionMigration metadata: creationTimestamp: "2020-02-04T19:26:54Z" generateName: storagestates.migration.k8s.io- generation: 1 name: storagestates.migration.k8s.io-kk589 resourceVersion: "3574" selfLink: /apis/migration.k8s.io/v1alpha1/storageversionmigrations/storagestates.migration.k8s.io-kk589 uid: 7dc9f088-aeac-49f8-bbf7-a31a25b55fdb spec: resource: group: migration.k8s.io resource: storagestates version: v1alpha1 status: conditions: - lastUpdateTime: "2020-02-04T19:30:49Z" message: '[Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch: unexpected EOF, Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/statefulsets.apps: unexpected EOF]' status: "True" type: Failed

- apiVersion: migration.k8s.io/v1alpha1 kind: StorageState metadata: creationTimestamp: "2020-02-04T19:26:54Z" generation: 1 name: storagestates.migration.k8s.io resourceVersion: "5683" selfLink: /apis/migration.k8s.io/v1alpha1/storagestates/storagestates.migration.k8s.io uid: 4301aff1-bb56-4a9f-8562-cac09ce5031a spec: resource: group: migration.k8s.io resource: storagestates status: currentStorageVersionHash: 7abAo0yHdNM= lastHeartbeatTime: "2020-02-04T19:44:11Z" persistedStorageVersionHashes: - Unknown

Got it. I think the "Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch: unexpected EOF" message means that the apiserver restarts in the middle of handling the PUT request from the migrator/core. We should let the migrator/core retry in that case, to avoid re-migrating the entire list of Batches.

Apart from that, I agree that we should launch a migration if ss.Status.PersistedSotrageVersionHashes!=ss.Status.CurrentStoragteVersionHash && (there is no pending or running storageMigration). Instead of a switch block nested in the else clause, I would make another variable to capture this state, something like this:

func (mt *MigrationTrigger) processDiscoveryResource(r metav1.APIResource) { ... stale := (getErr == nil && mt.staleStorageState(ss)) notFound := (getErr != nil && errors.IsNotFound(getErr)) // migrated checks if currentStorageVersionHash==persistedStorageVersionHashes, needMigration := !migrated(ss) && !pendingOrRuningMigrations(ss) // If storage version has changed, we need to restart any running storageMigration. storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash) if stale { .... } if stale || needMigration || storageVersionChanged || notFound { relaunch... } }

@caesarxuchao Done. PTAL.

deads2k · 2020-03-12T11:49:13Z

pkg/controller/kubemigrator.go

 	metrics.Metrics.ObserveFailedMigration(resource(m).String())
-	return err
+	return updateErr


deads2k · 2020-03-12T11:49:36Z

pkg/controller/kubemigrator.go

 		return err
 	}
-	_, err = km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error())
+	klog.Errorf("%v: migration failed: %v", m.Name, err)
+	_, updateErr := km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error())


if this is non-nil, utilruntime.HandleError

sanchezl · 2020-03-12T14:54:28Z

/test pull-kube-storage-version-migrator-disruptive

caesarxuchao

Some nits, otherwise lgtm.

caesarxuchao · 2020-03-19T00:02:27Z

pkg/trigger/discovery_handler.go

-	stale := (getErr == nil && mt.staleStorageState(ss))
-	storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash)
-	notFound := (getErr != nil && errors.IsNotFound(getErr))
+	found := getErr == nil || !errors.IsNotFound(getErr)


Can you remove the "|| !errors.IsNotFound(getErr)"? That case is handled by the return above and it's weird to call this condition "found" if it contains error.

@caesarxuchao Done

caesarxuchao · 2020-03-19T00:05:35Z

test/e2e/test-fully-automated.sh

@@ -36,6 +36,16 @@ cleanup() {
    return
  fi
  pushd "${MIGRATOR_ROOT}"
+    echo "===== initializer logs"
+    kubectl logs --namespace=kube-system job/initializer || true


This test doesn't launch the initializer, right?

@caesarxuchao Done

caesarxuchao · 2020-03-19T00:06:41Z

test/e2e/test-fully-automated.sh

@@ -59,3 +69,4 @@ pushd "${MIGRATOR_ROOT}"
  make e2e-test
  "${ginkgo}" -v "$@" "${MIGRATOR_ROOT}/test/e2e/e2e.test"
 popd
+


remove the empty line.

@caesarxuchao Done

caesarxuchao

/approve

LGTM. Please fix one more nit and squash. Feel free to get others to lgtm if I don't get to it soon.

caesarxuchao · 2020-03-19T20:51:40Z

pkg/migrator/core.go

@@ -93,6 +95,10 @@ func (m *migrator) Run() error {
 				Continue: continueToken,
 			},
 		)
+		if errors.IsNotFound(listError) {
+			// fail this migration, we don't want to get stuck on a migration for a resource that does not exist)


nits:
s/fail/Fail
s/)/.

@caesarxuchao done

k8s-ci-robot · 2020-03-19T20:55:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [caesarxuchao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2020-03-19T23:24:16Z

/lgtm

caesarxuchao · 2020-03-20T03:33:31Z

Great fix! Thanks.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 18, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 18, 2020

k8s-ci-robot requested review from deads2k and lavalamp February 18, 2020 16:24

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 18, 2020

caesarxuchao reviewed Feb 18, 2020

View reviewed changes

sanchezl force-pushed the sigs-fix-flaky-tests branch from 462c76c to 6f2c7a2 Compare February 26, 2020 15:12

sanchezl mentioned this pull request Feb 26, 2020

move to go.mod #47

Merged

deads2k reviewed Mar 12, 2020

View reviewed changes

pkg/controller/kubemigrator.go Outdated

metrics.Metrics.ObserveFailedMigration(resource(m).String())

return err

return updateErr

Copy link

deads2k Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err

deads2k reviewed Mar 12, 2020

View reviewed changes

sanchezl force-pushed the sigs-fix-flaky-tests branch from 6f2c7a2 to 00c6577 Compare March 12, 2020 14:11

sanchezl force-pushed the sigs-fix-flaky-tests branch 3 times, most recently from b038ea8 to 1c69f3f Compare March 17, 2020 20:56

caesarxuchao reviewed Mar 19, 2020

View reviewed changes

sanchezl force-pushed the sigs-fix-flaky-tests branch from 1c69f3f to c3cc678 Compare March 19, 2020 15:25

sanchezl mentioned this pull request Mar 19, 2020

REQUEST: New membership for sanchezl kubernetes/org#1730

Closed

6 tasks

caesarxuchao reviewed Mar 19, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2020

harden against disruptive loss of apiserver

8740b3a

sanchezl force-pushed the sigs-fix-flaky-tests branch from c3cc678 to 8740b3a Compare March 19, 2020 23:07

k8s-ci-robot assigned deads2k Mar 19, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2020

k8s-ci-robot merged commit d0b0463 into kubernetes-sigs:master Mar 19, 2020

sanchezl mentioned this pull request Mar 20, 2020

add more logging at default level openshift/kubernetes-kube-storage-version-migrator#153

Closed

sanchezl mentioned this pull request Mar 23, 2020

The disruptive test is flaky #51

Closed

caesarxuchao mentioned this pull request Mar 30, 2020

Make storage migrator tests required kubernetes/test-infra#17011

Merged

Harden against disruptive loss of apiserver #58

Harden against disruptive loss of apiserver #58

Conversation

sanchezl commented Feb 18, 2020

k8s-ci-robot commented Feb 18, 2020

deads2k commented Feb 18, 2020

sanchezl commented Feb 18, 2020

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchezl commented Mar 12, 2020

/test pull-kube-storage-version-migrator-disruptive

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 19, 2020

deads2k commented Mar 19, 2020

caesarxuchao commented Mar 20, 2020