Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

StackDriver metrics e2e test does not work in PROW #317

Closed
nachocano opened this issue Oct 3, 2019 · 30 comments
Closed

StackDriver metrics e2e test does not work in PROW #317

nachocano opened this issue Oct 3, 2019 · 30 comments
Assignees
Labels
area/observability area/test-and-release Test infrastructure, tests or release priority/2 Nice to have feature but doesn't block current release defined by release/*
Milestone

Comments

@nachocano
Copy link
Member

  • I can consistently execute it locally and the test passes. In PROW, it doesn't. Needs further investigation...

GOOGLE_APPLICATION_CREDENTIALS=<path_to_json> E2E_PROJECT_ID=<id> go test --tags=e2e ./test/e2e/... -run ^TestStorageStackDriverMetrics$ -count 1
ok github.com/google/knative-gcp/test/e2e 156.576s ? github.com/google/knative-gcp/test/e2e/metrics [no test files]

@nachocano
Copy link
Member Author

/assign

@nachocano
Copy link
Member Author

Log from PROW

--- FAIL: TestStorageStackDriverMetrics (144.18s)
lifecycle.go:54: namespace is : "test-storage-stack-driver-metrics-nglsxtiv"
test_storage.go:278: Last term message => {"success":true}
test_storage.go:303: sleeping 1m0s to make sure metrics were pushed to stackdriver
test_storage.go:336: Filter expression: metric.type="custom.googleapis.com/cloud.run/source/event_count" metric.label.event_source="//storage.googleapis.com/buckets/storage-e2e-test-knative-boskos-17" metric.label.namespace_name="test-storage-stack-driver-metrics-nglsxtiv" resource.type="global" metric.label.resource_group="storages.events.cloud.run" metric.label.event_type="google.storage.object.finalize" metric.label.name="storage-e2e-test-knative-boskos-17-storage" metric.label.response_code="500" metric.label.response_code_class="5xx"
test_storage.go:346: failed to iterate over result: rpc error: code = NotFound desc = The metric referenced by the provided filter is unknown. Check the metric name and labels.
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d23950]

@nachocano
Copy link
Member Author

/cc @chizhg

@chizhg
Copy link
Member

chizhg commented Oct 3, 2019

Can this be relevant?

@chizhg
Copy link
Member

chizhg commented Oct 3, 2019

/assign @chaodaiG
Chao is oncall this week and he's willing to help :-)

@knative-prow-robot
Copy link
Contributor

@chizhg: GitHub didn't allow me to assign the following users: chaodaiG.

Note that only google members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @chaodaiG
Chao is oncall this week and he's willing to help :-)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chizhg
Copy link
Member

chizhg commented Oct 3, 2019

/cc @chaodaiG

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

This test failed the same way on my local desktop as in Prow, this is my first time running e2e tests from knative-gcp on my machine. The only difference is I invoked the test same way as in Prow, which is our standard way of testing:
./test/e2e-tests.sh --gcp-project [MYPROJECT]

Please check e2e-tests.sh and e2e-common.sh to see if there is any setup missing for running this test

@adrcunha
Copy link
Contributor

adrcunha commented Oct 4, 2019

Initial post clearly shows that the test was running against an already existing cluster. Tests running on Prow (or through e2e-tests.sh) create a new, clean cluster by default. Definitely that sounds like the cluster used to run the tests in the initial post has some extra configuration(s) that are not present in the test clusters created by the E2E test framework.

@nachocano
Copy link
Member Author

Please check e2e-tests.sh and e2e-common.sh to see if there is any setup missing for running this test

I've added some things regarding giving the service account editor permissions for monitoring... I've tried few other things in this WIP https://github.com/google/knative-gcp/pull/318/files, but still the same error.... I don't think it's a permission issue, as if I don't have the permissions, I would be getting a different error...

Do you have access to the pantheon UI to see if we can see the metrics there?

@nachocano
Copy link
Member Author

hmmm, this might be it... monitoring is not enabled perhaps?

gcloud services enable monitoring

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

Seems like the case:

if (( ! IS_PROW )); then
echo "Set up ServiceAccount for Pub/Sub Admin"
gcloud services enable pubsub.googleapis.com

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

we probably should add gcloud services enable monitoring right above the code in my previous comment

@nachocano
Copy link
Member Author

we probably should add gcloud services enable monitoring right above the code in my previous comment

yes! added it and running it in #320
let's see if that makes it work

@nachocano
Copy link
Member Author

hmmm still no luck...

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

I switched to your branch in #320 and ran it again(with gcloud services enable monitoring change) and it passed, it might be your refactoring as it didn't pass on the branch from #318(I may need to double check though). One thing I noticed is that the metrics were sent to stackdriver: https://pantheon.corp.google.com/logs/viewer?project=knative-boskos-36&folder&organizationId&minLogLevel=0&expandAll=false&timestamp=2019-10-04T17:07:51.997000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2019-10-04T16:11:08.660Z&interval=PT1H&resource=audited_resource&advancedFilter=%22%2F%2Fstorage.googleapis.com%2Fbuckets%2Fstorage-e2e-test-knative-boskos-36%22&scrollTimestamp=2019-10-04T16:55:32.068502370Z&dateRangeEnd=2019-10-04T17:11:08.660Z

EDIT: The log above was from the failed run in Prow from PR #320

@chizhg
Copy link
Member

chizhg commented Oct 4, 2019

we probably should add gcloud services enable monitoring right above the code in my previous comment

BTW, monitoring is already enabled on boskos project, so we do not need to enable it if running on Prow.

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

One of the possibility is that our service account might not have enough permission for querying Stackdriver logs. @chizhg , any idea how to check this?

@nachocano
Copy link
Member Author

One of the possibility is that our service account might not have enough permission for querying Stackdriver logs. @chizhg , any idea how to check this?

I removed the permissions in my local cluster, and the error message is different. I added monitoring editor permissions to the service account in the sh scripts

@nachocano
Copy link
Member Author

I switched to your branch in #320 and ran it again(with gcloud services enable monitoring change) and it passed, it might be your refactoring as it didn't pass on the branch from #318(I may need to double check though). One thing I noticed is that the metrics were sent to stackdriver: https://pantheon.corp.google.com/logs/viewer?project=knative-boskos-36&folder&organizationId&minLogLevel=0&expandAll=false&timestamp=2019-10-04T17:07:51.997000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2019-10-04T16:11:08.660Z&interval=PT1H&resource=audited_resource&advancedFilter=%22%2F%2Fstorage.googleapis.com%2Fbuckets%2Fstorage-e2e-test-knative-boskos-36%22&scrollTimestamp=2019-10-04T16:55:32.068502370Z&dateRangeEnd=2019-10-04T17:11:08.660Z

EDIT: The log above was from the failed run in Prow from PR #320

In #320 I don't have some of the changes of #318... Let me enable monitoring in #318, which is a simpler change, and check again

@chizhg
Copy link
Member

chizhg commented Oct 4, 2019

One of the possibility is that our service account might not have enough permission for querying Stackdriver logs. @chizhg , any idea how to check this?

Editor should have enough permissions to query Stackdriver logs, see https://cloud.google.com/logging/docs/access-control

@chaodaiG
Copy link
Contributor

chaodaiG commented Oct 4, 2019

I switched to your branch in #320 and ran it again(with gcloud services enable monitoring change) and it passed, it might be your refactoring as it didn't pass on the branch from #318(I may need to double check though). One thing I noticed is that the metrics were sent to stackdriver: https://pantheon.corp.google.com/logs/viewer?project=knative-boskos-36&folder&organizationId&minLogLevel=0&expandAll=false&timestamp=2019-10-04T17:07:51.997000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2019-10-04T16:11:08.660Z&interval=PT1H&resource=audited_resource&advancedFilter=%22%2F%2Fstorage.googleapis.com%2Fbuckets%2Fstorage-e2e-test-knative-boskos-36%22&scrollTimestamp=2019-10-04T16:55:32.068502370Z&dateRangeEnd=2019-10-04T17:11:08.660Z
EDIT: The log above was from the failed run in Prow from PR #320

In #320 I don't have some of the changes of #318... Let me enable monitoring in #318, which is a simpler change, and check again

I just checked out your branch from #318, and noticed that the enabling monitoring line was added inside the block which only excecute when running locally, it passed on my machine and I have left a comment in #318, please update and try again

@nachocano
Copy link
Member Author

giving it another try! thanks for the catch, it was my bad

@akashrv akashrv added priority/2 Nice to have feature but doesn't block current release defined by release/* release/1 labels Oct 7, 2019
@nachocano
Copy link
Member Author

still doesn't work in PROW... moving this up... @chizhg or @chaodaiG let me know if you guys have sometime this week to sit together and check the UI in the test cluster to see if the metric is actually being reported...

@adrcunha
Copy link
Contributor

adrcunha commented Oct 8, 2019

You should have project viewer access to the boskos project used by your test.

@nachocano nachocano assigned zargarpur and unassigned nachocano Mar 16, 2020
@nachocano
Copy link
Member Author

@chizhg we will need your help with this... Would you be able to sync with @zargarpur sometime soonish? This blocks all our E2E tests for metrics

@chizhg
Copy link
Member

chizhg commented Mar 17, 2020

Sure, I should have some time to help later this week.

@nachocano nachocano added this to the v0.14.0-M2 milestone Mar 18, 2020
@nachocano
Copy link
Member Author

We should sync with serving folks , especially @yanweiguo, as he has been adding E2E tests on their side, and may have encountered this before

@bskaplan
Copy link
Contributor

/assign

@grantr grantr modified the milestones: v0.14.0-M2, v0.14.0-M3 Apr 1, 2020
@zargarpur zargarpur removed their assignment Apr 1, 2020
@grantr grantr modified the milestones: v0.14.0-M3, v0.15.0-M2, Backlog Apr 15, 2020
@ericlem ericlem added priority/2 Nice to have feature but doesn't block current release defined by release/* and removed priority/1 Blocks current release defined by release/* label or blocks current milestone release/1 labels Apr 30, 2020
@ericlem ericlem removed the release/1 label May 14, 2020
@zargarpur
Copy link
Contributor

To follow up on this, I've tested out running some E2E tests for GCP Broker metrics in another PR and it was successfully able to assert that they exist, so I don't think this is a PROW issue (anymore at least)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/observability area/test-and-release Test infrastructure, tests or release priority/2 Nice to have feature but doesn't block current release defined by release/*
Projects
None yet