Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

B/R e2e add velero restic DC workaround, describe, namespace events, wait for CSI snapshot to be ready, IsDCReady wait for builds, azure-rg #654

Merged
merged 34 commits into from
May 2, 2022

Conversation

kaovilai
Copy link
Member

@kaovilai kaovilai commented Apr 29, 2022

Adding several enhancements to B/R suite test

  • Restic DC restore workaround
  • print namespace events to detect install failures
  • add B/R describe after b/r done
  • wait for CSI snapshot to be ready after backing up before deleting workload and restoring.
  • Update IsDCReady to wait for BuildConfig.
  • Fix label used to get velero pods
  • Failure logs from backup/restore are now done via downloadrequest
  • Fix Bug: Azure E2E using -rg resource group #658

Restic DC restore workaround
Allow us to add data verification in the future. Currently we only verify app is running and responsive.
We would need to add after PreBackupVerification which check app is running with PreBackupDataEntry or similar.

Namespace Events
This is useful for debugging pods not starting up or stuck.
Example event that could come up before #650 is merged if secret contains carriage return that would be hard to dig from artifacts.

Event: Error: couldn't find key access_key in Secret openshift-adp/oadp-ts-example-velero-1-aws-registry-secret, Src: kubelet, Reason: Failed

B/R describe
Help diagnose B/R failures when there is nothing in the restore logs. Maybe remove after vmware-tanzu/velero#4743

Wait for CSI snapshot to be ready (Uncovered by Namespace events enhancements)
On restore it is possible that snapshot isn't ready to be used as DataSource for PVC yet as shown at #654 (comment)

Event: failed to provision volume with StorageClass "gp2-csi": error getting handle for DataSource Type VolumeSnapshot by Name velero-mysql-l5sz5: snapshot velero-mysql-l5sz5 is not Ready, Type: Warning, Count: 10, Src: {PersistentVolumeClaim mysql-persistent mysql 72d98c8b-10ff-4a32-a319-88bed69e1227 v1 34225 }, Reason: ProvisioningFailed

We should wait for CSI snapshot to be ready before uninstalling the application.

Example output

2022/05/01 16:19:06 Backup for case mysql-e2e succeeded
2022/05/01 16:19:13 waiting for volume snapshot velero-mysql-6s8sm to be ready
...
2022/05/01 16:20:44 waiting for volume snapshot velero-mysql-6s8sm to be ready
2022/05/01 16:20:44 Uninstalling app for case mysql-e2e

Update IsDCReady to wait for BuildConfig.
We should wait for all builds to be complete before considering DC ready for backup/disaster simulation/restore processes.
Fix #646

Example output

2022/05/01 16:25:56 Installing application for case parks-e2e
2022/05/01 16:25:57 Build is not ready: restify-1
...
2022/05/01 16:26:52 Build is not ready: restify-1
2022/05/01 16:27:02 Running pre-backup function for case parks-e2e

Failure logs from backup/restore are now done via downloadrequest
This prevent unrelated to backup or restore pod errors from showing up.

@kaovilai
Copy link
Member Author

kaovilai commented Apr 30, 2022

Welp. Looks like I opened a new can of worms here
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oadp-operator/654/pull-ci-openshift-oadp-operator-master-4.9-operator-e2e-aws/1520234660829859840#1:build-log.txt%3A362

Event: Error creating: pods "mysql-6bb6964964-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount, provider "velero-privileged": Forbidden: not usable by user or serviceaccount], Type: Warning, Count: 6, Src: {ReplicaSet mysql-persistent mysql-6bb6964964 39a26745-ee02-4441-be87-7e694f07100b apps/v1 34184 }, Reason: FailedCreate
Event: failed to provision volume with StorageClass "gp2-csi": error getting handle for DataSource Type VolumeSnapshot by Name velero-mysql-l5sz5: snapshot velero-mysql-l5sz5 is not Ready, Type: Warning, Count: 10, Src: {PersistentVolumeClaim mysql-persistent mysql 72d98c8b-10ff-4a32-a319-88bed69e1227 v1 34225 }, Reason: ProvisioningFailed

Fixing error getting handle for DataSource Type VolumeSnapshot by Name velero-mysql-l5sz5: snapshot velero-mysql-l5sz5 is not Ready in #657

@kaovilai
Copy link
Member Author

kaovilai commented Apr 30, 2022

Browsing Azure container for backup shows resource group issue coming from #582

time="2022-04-30T03:54:36Z" level=error msg="Error backing up item" backup=openshift-adp/mysql-e2e-0cd8e63e-c839-11ec-88cf-0a580a81136e error="error getting volume info: rpc error: code = Unknown desc = compute.DisksClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\"ResourceGroupNotFound\" Message=\"Resource group '-rg' could not be found.\"" logSource="pkg/backup/backup.go:417" name=mysql-6bb6964964-kbwxj

looks like

CI_AZURE_RESOURCE_GROUP="${CI_AZURE_RESOURCE_GROUP}-rg"; \

is not getting the value from $AZURE_RESOURCE_FILE

@kaovilai
Copy link
Member Author

kaovilai commented May 1, 2022

/retest

@kaovilai kaovilai changed the title Print namespace events when installing apps fail B/R Suite Test adds restic DC workaround, namespace events, B/R describe, wait for CSI snapshot to be ready May 1, 2022
@kaovilai kaovilai force-pushed the namespace_events_app_suite_test branch from 60c9620 to 7b8ff54 Compare May 1, 2022 15:38
@kaovilai kaovilai changed the title B/R Suite Test adds restic DC workaround, namespace events, B/R describe, wait for CSI snapshot to be ready B/R Suite test restic DC workaround, namespace events, describe, wait for CSI snapshot to be ready, IsDCReady wait for builds May 1, 2022
@kaovilai kaovilai force-pushed the namespace_events_app_suite_test branch from c74bdc2 to e1cd727 Compare May 1, 2022 22:48
@kaovilai
Copy link
Member Author

kaovilai commented May 1, 2022

after tests pass except azure will merge #659 into this and close that out to verify. Also want to see if azure errors resource group errors (in velero backup logs) now show up properly. Velero deployment label change caused failure logs to not show up in e2e logs.

@kaovilai
Copy link
Member Author

kaovilai commented May 1, 2022

@kaovilai
Copy link
Member Author

kaovilai commented May 2, 2022

The resource group error #658 now output in e2e log https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oadp-operator/654/pull-ci-openshift-oadp-operator-master-4.10-operator-e2e-azure/1520908535485960192#1:build-log.txt%3A400

  Expected
      <[]string | len:2, cap:2>: [
          "time=\"2022-05-02T00:34:54Z\" level=info msg=\"1 errors encountered backup up item\" backup=openshift-adp/mysql-e2e-800070e1-c9af-11ec-9321-0a580a832067 logSource=\"pkg/backup/backup.go:413\" name=mysql-59b8bbcdc8-h9g7d",
          "time=\"2022-05-02T00:34:54Z\" level=error msg=\"Error backing up item\" backup=openshift-adp/mysql-e2e-800070e1-c9af-11ec-9321-0a580a832067 error=\"error getting volume info: rpc error: code = Unknown desc = compute.DisksClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\\\"ResourceGroupNotFound\\\" Message=\\\"Resource group '-rg' could not be found.\\\"\" logSource=\"pkg/backup/backup.go:417\" name=mysql-59b8bbcdc8-h9g7d",
      ]
  to equal
      <[]string | len:0, cap:0>: []
  In [It] at: /go/src/github.com/openshift/oadp-operator/tests/e2e/backup_restore_suite_test.go:141

@kaovilai
Copy link
Member Author

kaovilai commented May 2, 2022

@kaovilai: all tests passed!

go.mod Outdated Show resolved Hide resolved
controllers/velero.go Outdated Show resolved Hide resolved
Copy link
Member Author

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

tests/e2e/lib/apps.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2022
This is useful for debugging pods not starting up or stuck.
Example event that would come up before openshift#650 is merged
```
Event: Error: couldn't find key access_key in Secret openshift-adp/oadp-ts-example-velero-1-aws-registry-secret, Src: kubelet, Reason: Failed
```
@kaovilai kaovilai force-pushed the namespace_events_app_suite_test branch from a6ec56c to 641bd1f Compare May 2, 2022 16:02
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2022
@kaovilai kaovilai changed the title B/R Suite test restic DC workaround, namespace events, describe, wait for CSI snapshot to be ready, IsDCReady wait for builds B/R e2e add velero restic DC workaround, describe, namespace events, wait for CSI snapshot to be ready, IsDCReady wait for builds, azure-rg May 2, 2022
tests/e2e/lib/velero_helpers.go Outdated Show resolved Hide resolved
tests/e2e/lib/velero_helpers.go Outdated Show resolved Hide resolved
tests/e2e/lib/velero_helpers.go Outdated Show resolved Hide resolved
tests/e2e/lib/velero_helpers.go Outdated Show resolved Hide resolved
Copy link
Contributor

@deepakraj1997 deepakraj1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link

openshift-ci bot commented May 2, 2022

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VISACK !

@kaovilai kaovilai merged commit 0068184 into openshift:master May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Azure E2E using -rg resource group Bug: parks-app crashes after being restored when hitting the api
3 participants