Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to OpenCensus version 0.22.3 #1446

Merged
merged 4 commits into from
Apr 16, 2020

Conversation

aLekSer
Copy link
Collaborator

@aLekSer aLekSer commented Apr 2, 2020

This should fix Prometheus error messages on exporting with empty tags
list . Allocator Metrics.

Takeover this PR #893
Part of #1330
Closes #892

Fixing this error in logs (happened periodically):

textPayload: "2020/02/07 15:14:14 Failed to export to Prometheus: inconsistent label cardinality: expected 1 label values but got 0 in []string(nil)

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 756bedfa-a170-43fe-81ba-cc0b82d3eccd

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 886a4d3f-701d-439a-b665-90959e7607a7

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 2a354681-7c89-4083-954a-3733796c5ea1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: cb45e65e-87e9-4f40-a0f7-8e1ede5fe740

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-82fd75e

@aLekSer aLekSer marked this pull request as ready for review April 3, 2020 10:02
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 05af2bc5-7f24-48e6-b9d8-27cf385c4e7f

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-88fddd5

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 3, 2020

Tested this change with Grafana and Prometheus.
Error with inconsistent label cardinality now does not happen in the logs.
#1330 (comment)

@markmandel
Copy link
Member

@roberthbailey @pooneh-m should we aim to merge this in this release, or hold it until next release to give it time to bake? (Since RC is next Tuesday)

I'm inclined to push this to the next release, just to give it time to be tested over time. WDYT?

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 3, 2020

Setting it to on-hold. Test on Stackdriver failed:

2020/04/03 16:11:56 Failed to export to Stackdriver: rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| does not have at least one entry.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 3, 2020

Strange enough, but I was fixing similar thing a while ago:
#554
Testing an update of stackdriver exporter:

-       contrib.go.opencensus.io/exporter/stackdriver v0.8.0
+       contrib.go.opencensus.io/exporter/stackdriver v0.13.1

@roberthbailey
Copy link
Member

@aLekSer - can you split the vendor changes into a separate commit from the changes to pkg/ to make it easier to see how much impact there is on our code?

@markmandel - at first glance this seems to be making non-trivial code changes to our metric exporting which makes me tend to agree to push it to the next release to give it time to bake. On the other hand, how much extra testing does that actually get us? how much testing of this code path happens during our e2es? and would anyone test a pre-release build to verify the functionality on top of what the author and reviewer do? If not and it isn't destabilizing, then that would argue to get it in now since we would only get feedback on how well it works once it's released.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 3, 2020

Similar error is mentioned here also:
census-ecosystem/opencensus-go-exporter-stackdriver#234
@roberthbailey I am going to split these two changes properly (./pkg/ and go get), will try to do it inside this branch.

@markmandel markmandel added feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) area/operations Installation, updating, metrics etc kind/cleanup Refactoring code, fixing up documentation, etc labels Apr 7, 2020
@markmandel
Copy link
Member

Just taking a look - at first glance LGTM, @cyriltovena do you have time to double check this, since it was originally your code?

@cyriltovena
Copy link
Collaborator

I'll run through it quickly now

tag.Insert(keyFleetName, "none"),
tag.Insert(keyNodeName, "none"),
tag.Insert(keyStatus, "none"),
tag.Insert(keyMultiCluster, ""),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing now this is allowed to be empty ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert() should add this tag https://github.com/census-instrumentation/opencensus-go/blob/master/tag/map.go#L107:6
I will check tomorrow on how this applies to this metrics. Probably add a test for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyriltovena This change break Stackdriver exporter, so I readded "none" for now. I think this change could be done separately.

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I guess you tested this with our dashboard to make sure they still work ?

Really glad we don't have to do all those hacky moves anymore and that most of the issues are fixed.

I guess this buy us some time before we have to move to OpenTelemetry. 😁

@cyriltovena
Copy link
Collaborator

image
image
image
image
image
image
image
image
image
image

@cyriltovena
Copy link
Collaborator

cyriltovena commented Apr 11, 2020

The only one that does not work is the go-client cache. But I assume I did not need cache for this try out ? Code looks unchanged and the metrics comes from k8s client.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 12, 2020

Thanks @cyriltovena so much for providing screenshots, I need to fix stackdriver error message which is left for this PR.

2020/04/03 16:11:56 Failed to export to Stackdriver: rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| does not have at least one entry.

This should fix Prometheus error messages on exporting with empty tags
list in Allocator Metrics.
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 8101da27-e52d-4d2d-9fc4-7482c1ff117a

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-e4c126a

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 12, 2020

With "none" in tag values export to Stackdriver still is not working.
Example of Stackdriver output for version 1.4.0 as a reference point:
Stackdriver output

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: ae78ffcf-f440-4cc4-82d9-f9d130542c0f

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-5147a06

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 13, 2020

Debugged Opencensus Stackdriver exporter code a bit, added more logging:

E 2020-04-13T07:28:37.571709353Z 2020/04/13 07:28:37 Failed to export this to Stackdriver 0: gameservers_count
E 2020-04-13T07:28:37.571744133Z 2020/04/13 07:28:37 Failed to export this to Stackdriver 1: k8s_client_workqueue_unfinished_work_seconds
E 2020-04-13T07:28:37.571749105Z 2020/04/13 07:28:37 Failed to export this to Stackdriver 2: fleets_replicas_count
E 2020-04-13T07:28:37.571752888Z 2020/04/13 07:28:37 Failed to export this to Stackdriver 3: k8s_client_workqueue_work_duration_seconds
E 2020-04-13T07:28:37.571796974Z 2020/04/13 07:28:37 Failed to export to Stackdriver: rpc error: code = InvalidArgument desc = Field timeSeries[8].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| does not have at least one entry.

Stackdriver export is not working on the most recent stackdriver exporter also.

However the error with Stackdriver metrics is gone only when I comment out whole file: https://github.com/googleforgames/agones/blob/master/pkg/metrics/kubernetes_client.go

Need to understand which metric or group of them lead to this situation, because error on exporter like timeSeries[8] does not help to find erroneous metric.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 13, 2020

Narrowing down - k8s.io/client-go/util/workqueue metrics are failing with Stackdriver exporter workqueue.SetProvider(c).

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 13, 2020

Was able to fix Stackdriver exporter uploading if parameter is changed from view.Distribution(0) or view.Distribution() (both cause an error like above) to something like:

	runtime.Must(view.Register(&view.View{
		Name:        "k8s_client_workqueue_latency_seconds",
		Measure:     workQueueLatencyStats,
		Description: "How long an item stays in the work queue.",
		Aggregation: view.Distribution(0.0001),
		TagKeys:     []tag.Key{keyQueueName},
	}))

Going to put this in a separate PR.

Here is a good example how metric above got registered, it uses aggregation with number of buckets and not view.Distribution(0) (with 2 buckets):

return measureView(wp.Latency, view.Distribution(BucketsNBy10(1e-08, 10)...))

https://github.com/knative/pkg/blob/4e57475bc87c1aba3e7fcc0207b593dc7186eb83/metrics/workqueue.go#L78

@markmandel markmandel removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Apr 14, 2020
@markmandel
Copy link
Member

How is this PR looking? I think it's good to merge, but wanted to check first.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Apr 16, 2020

This PR will fix errors in Prometheus exporter, and I will produce a separate PR for fixing Stackdriver exporter view.Distribution(..) - with multiple parameters. (Even adding 0.0001 instead of 0 is fixing Stackdriver)

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will fix errors in Prometheus exporter, and I will produce a separate PR for fixing Stackdriver exporter view.Distribution(..) - with multiple parameters. (Even adding 0.0001 instead of 0 is fixing Stackdriver)

SG! Approving now!

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aLekSer, cyriltovena, markmandel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cyriltovena,markmandel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot
Copy link

New changes are detected. LGTM label has been removed.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: b5bdda16-c908-423d-bfff-08ad5dbda2be

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.6.0-1a98bf6

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: d2b54fcf-3a46-4f08-b89d-7963dbdfb256

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1446/head:pr_1446 && git checkout pr_1446
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.6.0-b06d6d2

@markmandel markmandel merged commit 3d22c68 into googleforgames:master Apr 16, 2020
@markmandel markmandel added this to the 1.6.0 milestone Apr 16, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
* Update to OpenCensus version 0.22.3

This should fix Prometheus error messages on exporting with empty tags
list in Allocator Metrics.

* Updates to the Metrics

Fix tests.

Co-authored-by: Mark Mandel <markmandel@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/operations Installation, updating, metrics etc kind/cleanup Refactoring code, fixing up documentation, etc size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to opencensus v0.22
6 participants