Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional anonymous Tempo usage reporting #1481

Merged
merged 37 commits into from
Jul 29, 2022

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Jun 9, 2022

What this PR does:

Here we implement an approach from the Loki squad for sending anonymous usage information to Grafana Labs to better understand the uses in the wild.

Fixes https://github.com/grafana/tempo-squad/issues/81
Related grafana/loki#5062

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala force-pushed the usageStats branch 2 times, most recently from 03da46a to 5d634ba Compare June 9, 2022 14:14
@zalegrala zalegrala changed the title Add Tempo usage stats Add optional anonymous Tempo usage reporting Jun 9, 2022
@zalegrala zalegrala marked this pull request as ready for review June 28, 2022 13:04
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts

cmd/tempo/app/app.go Outdated Show resolved Hide resolved
cmd/tempo/app/modules.go Show resolved Hide resolved
docs/tempo/website/configuration/_index.md Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
modules/distributor/receiver/shim.go Outdated Show resolved Hide resolved
case "opencensus":
receiverOpencensusStats.Set(1)
case "kafka":
receiverKafkaStats.Set(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default, log an error?

is there a way to associate a string label with this "stat"? so instead of a bunch of individual stats we could says receiverStats.Type(<name>).Set(1) or something?

Copy link
Contributor

@mdisibio mdisibio Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to associate a string label with this "stat"? so instead of a bunch of individual stats we could says receiverStats.Type().Set(1) or something?

Discussed offline a bit, and the idea here is to centralize the stats more concretely in the usagestats package. Instead of calling something like usagestats.NewInt("feature_enabled_search"), int.Set(1) throughout the code base, this could be usagestats.SetFeatureEnabledSearch(1). The SetFeatureEnabledSearch method would call NewInt/Set.

Couple different ways to do this, but that's the idea.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design of the usagesats package is such that it doesn't care at all about what stats are being set throughout the code base. To create some helper methods in the package might help readability on the implementation side, but I'm thinking the separation of concerns here is somewhat nice. If the variables names with a Set() method are ugly to look at, we could also make package local fucntions like setFeatureEnabledSearch() that would make the necessary calls in.

With a package variable like featureEnabledSearch, calling featureEnabledSearch.Set(1) feels almost as readable as usagestats.SetFeatureEnabledSearch(), the difference being that now the usagestats needs modification for each stat we want to include. Is there another advantage of moving the stats to the usagestats package here I'm not thinking of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just playing with it a little, to move this noise out of the New function could just be something like recordConfigBoolStat(cfg.AuthEnabled, statAuthEnabled). Additionally we could move all the stats into a stats file within the packages that implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either way. I like @mdisibio's suggestion to consolidate the stats b/c it will help users of the software quickly see what we are reporting. I don't consider this a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I see that it also might prevent sharing the package in the future, but I don't know how much to hold on to that idea.

Copy link
Contributor Author

@zalegrala zalegrala Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could move all the variables to modules.go if we wanted one place to look. We mostly need access to the config, which could be done in a helper on the Server. Hows that sound? I think that might help smooth out a dependency loop for the backend creation also when trying to implement backend.New() for shared use.

pkg/usagestats/config.go Outdated Show resolved Hide resolved
pkg/usagestats/reporter.go Show resolved Hide resolved
tempodb/backend/raw.go Outdated Show resolved Hide resolved
@zalegrala
Copy link
Contributor Author

That's great, thanks for the feedback.

Copy link
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns as discussed offline:

  1. I'm a little skeptical to adding a dependency on the backend storage to the usage reporting module - not only because of the complexity of running the module but also an unexpected failure scenario with the backend that might result in a component crash. I believe we could hold the token in the memberlistkv for components, in which case it will only ever be recreated if all components are wiped and restarted - at which point we might as well call it a new cluster. (Maybe we can go ahead for now but ease the dependency at some point).
  2. With the additional dependency on the backend, components like the distributor which initially had no dependency on the backend will have to be configured with the read token from GCS/S3/Azure. This is a breaking change and needs to be communicated clearly and updated in all docs.
  3. Schema changes - I'm not sure what happens if we change the usage report schema, does an old payload get rejected with a 400? Or do we accept with 200 ok but drop it internally? How hard is it to change the associated dashboards?
  4. The backoff to send data / wait for the token to be created should definitely be made configurable. In a large cluster with a 100 components, it can result in pretty spikey memberlist traffic if all components query the kv store every second

pkg/usagestats/reporter.go Outdated Show resolved Hide resolved
@zalegrala
Copy link
Contributor Author

zalegrala commented Jul 6, 2022

Thanks for the review @annanay25.

  1. You are probably right that we should call it a new cluster if all components go down. I don't have too strong a feelings about it, but persistent storage for cluster identification seems somewhat nice to me, since all the trace data would persist after all systems were shut down. I think about power outages also, which is probably rare for the environments that we are most likely deployed. I can test without storage and start refactoring if we have strong concerns about the distributor having access to the backend.
  2. Good callout on the breaking change here. I'll make sure to update some docs if we choose to keep this dependency.
  3. This has come up a few times, but I'm not sure where to communicate this in docs. The actuall json payload that gets sent has a few rigid items in there about the build information of the binary, so we would need to keep that structure in order to allow the bigquery data to be useful long term. Additionally, there is a map[string]interface{} metrics section of the payload that we have full control over. When we go to query, if the string keys there change, then dashboards will need to be udpated. Also keep in mind that the speed at which those change are the speed at which folks are upgrading, so ideally this remains somewhat stable. We can add new metrics easy enough. Nothing server side receiving this payload will reject the payload based on the schema.
  4. a: We can make the backoff for token retrieval configuratble, but note too that only the ingeters currently will be querying memberlist for the data, so somewhat reduced set of load on the memberlist. Though that is the current form. If we make changes so that the backend is never used for cluster ID, then this would increase the load on the memberlist.
  5. b: As for the data sending interval, I'm slightly of two minds about this. On one hand a configuration option seems good. On the other, having a hardcode means we can avoid users changing the load placed on us receiving the payload since the interval at which the reports are sent would be constant in the binary.

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a minor suggestion to the text.

@@ -17,6 +17,7 @@ This document explains the configuration options for Tempo as well as the detail
- [storage](#storage)
- [memberlist](#memberlist)
- [overrides](#overrides)
- [usage-report](#usage-report)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is placed below search in the actual description

Comment on lines 21 to 23
f.DurationVar(&cfg.Backoff.MaxBackoff, util.PrefixConfig(prefix, "backoff.max_backoff"), time.Minute, "maximum time to back off retry")
f.DurationVar(&cfg.Backoff.MinBackoff, util.PrefixConfig(prefix, "backoff.min_backoff"), time.Second, "minimum time to back off retry")
f.IntVar(&cfg.Backoff.MaxRetries, util.PrefixConfig(prefix, "backoff.max_retries"), 0, "maximum number of times to retry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace this with cfg.Backoff.RegisterFlagsWithPrefix("usage-report", f)

// RegisterFlagsWithPrefix for Config.
func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
f.DurationVar(&cfg.MinBackoff, prefix+".backoff-min-period", 100*time.Millisecond, "Minimum delay when backing off.")
f.DurationVar(&cfg.MaxBackoff, prefix+".backoff-max-period", 10*time.Second, "Maximum delay when backing off.")
f.IntVar(&cfg.MaxRetries, prefix+".backoff-retries", 10, "Number of times to backoff and retry before failing.")
}

And then we can override the default if needed. But the flag names should be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated. Hows that?

r, err := NewReporter(Config{Leader: true, Enabled: true}, kv.Config{
Store: "",
}, objectClient, objectClient, log.NewLogfmtLogger(os.Stdout), prometheus.NewPedanticRegistry())
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we error on a wrong k/v store value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because only the ingesters will require the k/v store. This will get checked when the ingester is started during the call to running().

Copy link
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a final few comments but approving to unblock once done!

@zalegrala zalegrala merged commit a72a095 into grafana:main Jul 29, 2022
@zalegrala zalegrala deleted the usageStats branch July 29, 2022 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants