Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: what is rate_by_service? #3031

Closed
fommil opened this issue Feb 13, 2019 · 13 comments
Closed

question: what is rate_by_service? #3031

fommil opened this issue Feb 13, 2019 · 13 comments
Labels
team/agent-apm trace-agent

Comments

@fommil
Copy link

fommil commented Feb 13, 2019

The /v0.4/traces endpoint returns a JSON object containing a rate_by_services map.

Is there further documentation about what this is? It is unclear what "rate" means, if the client should act upon it, or what the unit is.

A line comment says "the recommended sampling rates for all services". Could you please provide more actionable information for a client author? e.g. what happens if the client ignores it?

(Some ambiguous interpretations I have are: if this drops below 1.0, the client should begin to sample/drop their traces before sending, or if the rate drops the client should adjust the frequency of submissions, or perhaps this is simply informational to let clients know how many traces are being rejected due to invalid content)

@fommil fommil changed the title Question: Question: what is rate_by_services? Feb 13, 2019
@olivielpeau olivielpeau added the team/agent-apm trace-agent label Feb 13, 2019
@gbbr
Copy link
Contributor

gbbr commented Feb 18, 2019

Hi Sam!

I'd be curious to hear what you are working on and why you aren't using one of our own tracers. We've not really spent time documenting our API endpoints given that they are still in the v0.x stage, but I'll try go give you an explanation here and provide example references in our own Go tracer.

The map is used to determine the behavior of the priority sampler, which uses these rates to set the sampling priority. Clients are expected to keep these rates and use them after the agent sends them.

Using the map

Some simple rules to using the rates map:

  • The keys in the map are of the form service:<service_name>,env:<env_name>
    • For a trace with the service web and the environment (span tag named env) prod it will to use the rate from the key service:web,env:prod, when found.
    • For a trace with the service sql and no environment it will use the rate found in the key service:sql,env:, when found.
    • If no entry is found in the map for a trace it will use the default rate from the key service:,env: which should be there in all responses from the agent.
  • When the client doesn't yet have a map of rates, it uses the rate 1 for all traces.

Using the rate

The formula used to sample by rate in all Datadog systems is consistent and is expected to be the same everywhere.

The sampling formula looks like this:

const knuthFactor = uint64(1111111111111111111)

// sampledByRate verifies if the trace with the given traceID should be sampled.
func sampledByRate(traceID uint64, rate float64) bool {
	if rate < 1 {
		return traceID*knuthFactor < uint64(rate*math.MaxUint64)
	}
	return true
}

Applying the sampling decision

Based on the value returned by the rate sampler, sampling priority tags are applied:

  • The span metric _sampling_priority_v1 will hold the value float64(1.0) if the trace was sampled.
  • The span metric _sampling_priority_v1 will hold the value float64(0) if the trace was not sampled.
  • The span metric _sampling_priority_rate_v1 will contain the rate at which the trace was sampled (or not).

These decisions also need to be propagated in HTTP requests as X-Datadog-Sampling-Priority.


You may check the implementation in our Go tracer as an example.

@gbbr gbbr changed the title Question: what is rate_by_services? question: what is rate_by_service? Feb 18, 2019
@fommil
Copy link
Author

fommil commented Feb 18, 2019

Thanks @gbbr

"Clients are expected to keep these rates and use them after the agent sends them" is the bit I am most interested in. Could you please provide more actionable information for a client author? e.g. what happens if the client ignores it?

We are commercial customers of datadog and we had a discussion with one of your "Customer Success Manager" where it was decided that we needed to build our own client for Haskell, because datadog does not provide one.

@gbbr
Copy link
Contributor

gbbr commented Feb 18, 2019

Nothing will happen if you ignore it. But setting the sampling priority metric on traces (and propagating it cross-wire) is important. If you don't, you'll end up seeing only pieces of distributed traces because the agent (and backend) will sample them differently. The rates map provides a good guide as to how much interest should be put into a specific service and is backed by algorithms that aim to guarantee a good variety of traces will end up in the application. I don't see why you'd want to ignore it.

If you chose to not use the rate sampling to apply priority sampling, you can chose your own logic and that's no problem (to my knowledge), but definitely don't set priority sampling of 1 (auto keep) or 2 (user keep) on all traces because that will quickly result in you reaching your account quota limits.

Does this help?

@fommil
Copy link
Author

fommil commented Feb 18, 2019

thanks @gbbr ... I think there is still an assumption that the words "sample" and "rate" are known quantities and if this is reporting actuals or desired values.

Let's get to some specifics:

  • what is the unit of the "rate". e.g. is it number of traces per second, spans per second, submissions per PUT request, PUT requests per second, or something else?
  • is the rate reporting on current values or desired values?

As I understand it, from reading the source code of this repository, the agent is enforcing that the value is [0.0, 1.0] so I don't know how a value of 2 is possible. Nor do I understand why you are saying "definitely don't set priority sampling of..." since the rate_by_service is a value returned to us by the agent, and we are writing the client.

@gbbr
Copy link
Contributor

gbbr commented Feb 18, 2019

I think there is still an assumption that the words "sample" and "rate" are known quantities

Sorry about that. That is many times the case when talking to someone who's been in a domain for too long, even though I try to avoid it.

what is the unit of the "rate"

It's the parameter of rate given to the formula described in my previous comment in the "Using the rate" section. I'm a bit confused as to which part is unclear. It is the specific rate that needs to be applied in the formula when sampling a trace having the given service and environment.

As I understand it, from reading the source code of this repository, the agent is enforcing that the value is [0.0, 1.0] so I don't know how a value of 2 is possible. Nor do I understand why you are saying "definitely don't set priority sampling of..." since the rate_by_service is a value returned to us by the agent, and we are writing the client.

I've created a confusion here. The only possible sampling rate can be [0.0, 1.0] as you say. For example 1.0 means 100% sampling and 0.3 means 30% sampling. The other value which is the sampling priority can be [-2, 2] as shown here.

The sampling rate: determines the rate at which a trace is sampled.
The sampling priority: reflects the sampling decision after the rate has been applied.

Feel free to reach out to me on our public Slack, it might be easier.

@fommil
Copy link
Author

fommil commented Feb 18, 2019

thanks again @gbbr . To be clear, I'm not talking about sampling priorities at all, I'm only interested in the rate_by_service field in the response from the agent to the client.

With that in mind, let's return to the specifics:

  • what is the unit of the "rate"? e.g. is it number of traces per second, number of spans per second, bytes submitted per second, spans/traces per PUT request, PUT requests per second, percentage of valid or retained spans, or something else? I have read your comment in detail, and the answer to this question is still not clear to me.
  • is the "rate" reporting on current values, or is it requesting desired values? e.g. by encouraging the client to either increase or decrease its rate of submissions.

@fommil
Copy link
Author

fommil commented Feb 20, 2019

I think I'm just going to take this away

Nothing will happen if you ignore it.

and ignore the field. Thanks anyways.

@fommil fommil closed this as completed Feb 20, 2019
@gbbr
Copy link
Contributor

gbbr commented Feb 20, 2019

Dear Sam,

The best and easiest way to write your own tracer is to just get inspired from an already existing one. The Go and the Python ones are a good choice.

And yes, it'll be fine to ignore it for now.

@fommil
Copy link
Author

fommil commented Feb 20, 2019

thanks @gbbr I really don't think I need to read the other examples, the v0.3/traces API is very simple and I read this datadog-agent code to understand how all the fields are validated (btw, a 10 minute maximum span length may cause problems for some of our longer running processes, you may wish to make that configurable, and some of the other limitations such as 2000-01-01 seem very arbitrary to me and could very well impact test scenarios. What constitutes a valid text or tag field is far too complex to reproduce on the client side, I'll have to pick a stricter subset when encoding it).

To implement the v0.4 API, all that I need to know is a technical definition of what "rate" means and how to act upon it. Given that "no action" is fine, understanding what it actually means seems to be irrelevant. I should point out that I still don't know if "rate" means:

  • number of traces per second
  • number of spans per second
  • bytes submitted per second
  • spans or traces per PUT request
  • PUT requests per second
  • percentage of valid or retained spans
  • or something else

There is an implicit assumption that this is obvious. If the meaning is in this list, I'd very much appreciate it if you could point it out. From a client implementer point of view, I can assure you that it is not obvious.

@gbbr
Copy link
Contributor

gbbr commented Feb 20, 2019

It is none of that. Rate is just a number that clients are meant to use with the formula I've described above to mark a trace with a sampling priority in order to consistently get full distributed traces.

Clients are meant to decide whether a trace will be kept or not (in other words "sampled or not") by using that formula and the rate recommended by the agent.

If you look at how the formula works, you'll soon understand that a rate of 1 means 100% and a rate of 0.5 means 50% chance of being sampled.

I'm sorry but I'm not able to tell where the misunderstanding is and which part it is that you don't understand.

@gbbr
Copy link
Contributor

gbbr commented Feb 20, 2019

how to act upon it

My very first comment is a detailed step-by-step guide on how to use the endpoint. I'm lost as to which part is confusing.

@fommil
Copy link
Author

fommil commented Feb 20, 2019

I'll jump on your slack as you suggested @gbbr ... perhaps this is better in realtime, although I was hoping to keep an audit of the conversation for our records.

@fommil
Copy link
Author

fommil commented Feb 20, 2019

Conclusion from slack:

"rate" is a number between [0.0, 1.0] indicating the desired percentage of traces that the agent wishes to downsample for a given service (0.0 meaning dropping everything, 1.0 meaning keeping everything).

Clients should act upon rate by setting a _sampling_priority_v1 field in metrics of the root span, which is an enum [-1, 0, 1, 2] indicating: -1) that the agent should drop the trace, 0) the agent may drop the trace, 1) the agent should try to keep the trace, or 2) the agent must keep the trace (subject to account limits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/agent-apm trace-agent
Projects
None yet
Development

No branches or pull requests

3 participants