Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden against disruptive loss of apiserver #58

Merged
merged 1 commit into from
Mar 19, 2020

Conversation

sanchezl
Copy link
Contributor

This fixes the flaky e2e tests. Loss of the apiserver at inopportune moments left the combined StorageState and StorageVersionMigration resources out of sync with regards to the migration trigger.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 18, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @sanchezl. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 18, 2020
@deads2k
Copy link

deads2k commented Feb 18, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 18, 2020
@sanchezl
Copy link
Contributor Author

/test pull-kube-storage-version-migrator-fully-automated-e2e

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchezl can you say more on how did the storageState and the storageVersionMigration ended to be inconsistent?

@@ -96,3 +96,22 @@ roleRef:
kind: ClusterRole
name: storage-version-migration-initializer
apiGroup: rbac.authorization.k8s.io
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage-version-migration-migrator gives the migrator the ability to get, list, update all resources, is that not enough?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage-version-migration-migrator gives the migrator the ability to get, list, update all resources, is that not enough?

@sanchezl fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kube-storage-version-migrator/47/pull-kube-storage-version-migrator-manually-launched-e2e/1222290262227685376/build-log.txt

I0128 23:00:56.303078       1 round_trippers.go:443] PUT https://10.0.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/edit 403 Forbidden in 39 milliseconds
I0128 23:00:56.303160       1 round_trippers.go:449] Response Headers:
I0128 23:00:56.303170       1 round_trippers.go:452]     Audit-Id: aed0307a-49ed-4e0b-9514-92ed370669b7
I0128 23:00:56.303176       1 round_trippers.go:452]     Content-Type: application/json
I0128 23:00:56.303182       1 round_trippers.go:452]     Date: Tue, 28 Jan 2020 23:00:56 GMT
I0128 23:00:56.303455       1 request.go:1017] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"clusterroles.rbac.authorization.k8s.io \"edit\" is forbidden: user \"system:serviceaccount:kube-system:default\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:kube-system\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the storage-version-migration-migrator role and fix the storage-version-migration-migrator role binding?

I agree that the migrator does need the cluster-admin power. Otherwise it can hit the pasted error when migrating roles/clusterroles due to priviledge escalation.

metrics.Metrics.ObserveFailedMigration(resource(m).String())
return err
klog.Errorf("%v: migration failed: %v", m.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you move this log line to before line 127? I feel that's easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -93,6 +95,9 @@ func (m *migrator) Run() error {
Continue: continueToken,
},
)
if errors.IsNotFound(listError) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen only if the resource is deleted, e.g., the CRD is deleted when the migrator is migrating the CR. This should not be counted as a migration failure. I suggest that we log the error, but return nil.

(The current code isn't handling this correctly either.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone creates a StorageVersionMigration for a CRD that does not exist, the migrator will be stuck here (unable to proceed due to its serial nature) unless we fail the migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can put this in a comment?

switch {
case len(migrations) == 0:
// corresponding StorageVersionMigration resource missing
relaunchMigration = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if storageVersionChanged is false, why do we need to relaunch the migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caesarxuchao StorageState created, but migration creation failed due to disruption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think #58 (comment) is more readable.

relaunchMigration = true
case ss.Status.PersistedStorageVersionHashes[0] == migrationv1alpha1.Unknown:
migration := migrations[0].(*migrationv1alpha1.StorageVersionMigration)
if controller.HasCondition(migration, migrationv1alpha1.MigrationFailed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what leads to this situation? The kubemigrator does not give up easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kube-storage-version-migrator/47/pull-kube-storage-version-migrator-disruptive/1224770718261055490/build-log.txt

Feb  4 19:44:47.629: INFO: resource {migration.k8s.io storagestates} has persisted hashes [Unknown], and current hash 7abAo0yHdNM=
Feb  4 19:44:47.629: INFO: timed out waiting for the condition
- apiVersion: migration.k8s.io/v1alpha1
  kind: StorageVersionMigration
  metadata:
    creationTimestamp: "2020-02-04T19:26:54Z"
    generateName: storagestates.migration.k8s.io-
    generation: 1
    name: storagestates.migration.k8s.io-kk589
    resourceVersion: "3574"
    selfLink: /apis/migration.k8s.io/v1alpha1/storageversionmigrations/storagestates.migration.k8s.io-kk589
    uid: 7dc9f088-aeac-49f8-bbf7-a31a25b55fdb
  spec:
    resource:
      group: migration.k8s.io
      resource: storagestates
      version: v1alpha1
  status:
    conditions:
    - lastUpdateTime: "2020-02-04T19:30:49Z"
      message: '[Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch:
        unexpected EOF, Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/statefulsets.apps:
        unexpected EOF]'
      status: "True"
      type: Failed
- apiVersion: migration.k8s.io/v1alpha1
  kind: StorageState
  metadata:
    creationTimestamp: "2020-02-04T19:26:54Z"
    generation: 1
    name: storagestates.migration.k8s.io
    resourceVersion: "5683"
    selfLink: /apis/migration.k8s.io/v1alpha1/storagestates/storagestates.migration.k8s.io
    uid: 4301aff1-bb56-4a9f-8562-cac09ce5031a
  spec:
    resource:
      group: migration.k8s.io
      resource: storagestates
  status:
    currentStorageVersionHash: 7abAo0yHdNM=
    lastHeartbeatTime: "2020-02-04T19:44:11Z"
    persistedStorageVersionHashes:
    - Unknown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think the "Put https://10.0.0.1:443/apis/migration.k8s.io/v1alpha1/storagestates/jobs.batch: unexpected EOF" message means that the apiserver restarts in the middle of handling the PUT request from the migrator/core. We should let the migrator/core retry in that case, to avoid re-migrating the entire list of Batches.

Apart from that, I agree that we should launch a migration if ss.Status.PersistedSotrageVersionHashes!=ss.Status.CurrentStoragteVersionHash && (there is no pending or running storageMigration). Instead of a switch block nested in the else clause, I would make another variable to capture this state, something like this:

func (mt *MigrationTrigger) processDiscoveryResource(r metav1.APIResource) {
        ...
	stale := (getErr == nil && mt.staleStorageState(ss))
        notFound := (getErr != nil && errors.IsNotFound(getErr))
        // migrated checks if currentStorageVersionHash==persistedStorageVersionHashes, 
        needMigration := !migrated(ss) && !pendingOrRuningMigrations(ss)
        // If storage version has changed, we need to restart any running storageMigration.
	storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash)
	if stale {
            ....
        }
        if stale || needMigration || storageVersionChanged || notFound {
             relaunch...
        }
}
        

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caesarxuchao Done. PTAL.

metrics.Metrics.ObserveFailedMigration(resource(m).String())
return err
return updateErr
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err

return err
}
_, err = km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error())
klog.Errorf("%v: migration failed: %v", m.Name, err)
_, updateErr := km.updateStatus(m, migrationv1alpha1.MigrationFailed, err.Error())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is non-nil, utilruntime.HandleError

@sanchezl
Copy link
Contributor Author

/test pull-kube-storage-version-migrator-disruptive

@sanchezl sanchezl force-pushed the sigs-fix-flaky-tests branch 3 times, most recently from b038ea8 to 1c69f3f Compare March 17, 2020 20:56
Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, otherwise lgtm.

stale := (getErr == nil && mt.staleStorageState(ss))
storageVersionChanged := (getErr == nil && ss.Status.CurrentStorageVersionHash != r.StorageVersionHash)
notFound := (getErr != nil && errors.IsNotFound(getErr))
found := getErr == nil || !errors.IsNotFound(getErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the "|| !errors.IsNotFound(getErr)"? That case is handled by the return above and it's weird to call this condition "found" if it contains error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -36,6 +36,16 @@ cleanup() {
return
fi
pushd "${MIGRATOR_ROOT}"
echo "===== initializer logs"
kubectl logs --namespace=kube-system job/initializer || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't launch the initializer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -59,3 +69,4 @@ pushd "${MIGRATOR_ROOT}"
make e2e-test
"${ginkgo}" -v "$@" "${MIGRATOR_ROOT}/test/e2e/e2e.test"
popd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the empty line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

LGTM. Please fix one more nit and squash. Feel free to get others to lgtm if I don't get to it soon.

@@ -93,6 +95,10 @@ func (m *migrator) Run() error {
Continue: continueToken,
},
)
if errors.IsNotFound(listError) {
// fail this migration, we don't want to get stuck on a migration for a resource that does not exist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:
s/fail/Fail
s/)/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2020
@deads2k
Copy link

deads2k commented Mar 19, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit d0b0463 into kubernetes-sigs:master Mar 19, 2020
@caesarxuchao
Copy link
Member

Great fix! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants