question: what is rate_by_service? #3031

fommil · 2019-02-13T10:28:12Z

The /v0.4/traces endpoint returns a JSON object containing a rate_by_services map.

Is there further documentation about what this is? It is unclear what "rate" means, if the client should act upon it, or what the unit is.

A line comment says "the recommended sampling rates for all services". Could you please provide more actionable information for a client author? e.g. what happens if the client ignores it?

(Some ambiguous interpretations I have are: if this drops below 1.0, the client should begin to sample/drop their traces before sending, or if the rate drops the client should adjust the frequency of submissions, or perhaps this is simply informational to let clients know how many traces are being rejected due to invalid content)

The text was updated successfully, but these errors were encountered:

gbbr · 2019-02-18T11:49:59Z

Hi Sam!

I'd be curious to hear what you are working on and why you aren't using one of our own tracers. We've not really spent time documenting our API endpoints given that they are still in the v0.x stage, but I'll try go give you an explanation here and provide example references in our own Go tracer.

The map is used to determine the behavior of the priority sampler, which uses these rates to set the sampling priority. Clients are expected to keep these rates and use them after the agent sends them.

Using the map

Some simple rules to using the rates map:

The keys in the map are of the form service:<service_name>,env:<env_name>
- For a trace with the service web and the environment (span tag named env) prod it will to use the rate from the key service:web,env:prod, when found.
- For a trace with the service sql and no environment it will use the rate found in the key service:sql,env:, when found.
- If no entry is found in the map for a trace it will use the default rate from the key service:,env: which should be there in all responses from the agent.
When the client doesn't yet have a map of rates, it uses the rate 1 for all traces.

Using the rate

The formula used to sample by rate in all Datadog systems is consistent and is expected to be the same everywhere.

The sampling formula looks like this:

const knuthFactor = uint64(1111111111111111111)

// sampledByRate verifies if the trace with the given traceID should be sampled.
func sampledByRate(traceID uint64, rate float64) bool {
	if rate < 1 {
		return traceID*knuthFactor < uint64(rate*math.MaxUint64)
	}
	return true
}

Applying the sampling decision

Based on the value returned by the rate sampler, sampling priority tags are applied:

The span metric _sampling_priority_v1 will hold the value float64(1.0) if the trace was sampled.
The span metric _sampling_priority_v1 will hold the value float64(0) if the trace was not sampled.
The span metric _sampling_priority_rate_v1 will contain the rate at which the trace was sampled (or not).

These decisions also need to be propagated in HTTP requests as X-Datadog-Sampling-Priority.

You may check the implementation in our Go tracer as an example.

fommil · 2019-02-18T12:00:19Z

Thanks @gbbr

"Clients are expected to keep these rates and use them after the agent sends them" is the bit I am most interested in. Could you please provide more actionable information for a client author? e.g. what happens if the client ignores it?

We are commercial customers of datadog and we had a discussion with one of your "Customer Success Manager" where it was decided that we needed to build our own client for Haskell, because datadog does not provide one.

gbbr · 2019-02-18T12:17:53Z

Nothing will happen if you ignore it. But setting the sampling priority metric on traces (and propagating it cross-wire) is important. If you don't, you'll end up seeing only pieces of distributed traces because the agent (and backend) will sample them differently. The rates map provides a good guide as to how much interest should be put into a specific service and is backed by algorithms that aim to guarantee a good variety of traces will end up in the application. I don't see why you'd want to ignore it.

If you chose to not use the rate sampling to apply priority sampling, you can chose your own logic and that's no problem (to my knowledge), but definitely don't set priority sampling of 1 (auto keep) or 2 (user keep) on all traces because that will quickly result in you reaching your account quota limits.

Does this help?

fommil · 2019-02-18T13:35:20Z

thanks @gbbr ... I think there is still an assumption that the words "sample" and "rate" are known quantities and if this is reporting actuals or desired values.

Let's get to some specifics:

what is the unit of the "rate". e.g. is it number of traces per second, spans per second, submissions per PUT request, PUT requests per second, or something else?
is the rate reporting on current values or desired values?

As I understand it, from reading the source code of this repository, the agent is enforcing that the value is [0.0, 1.0] so I don't know how a value of 2 is possible. Nor do I understand why you are saying "definitely don't set priority sampling of..." since the rate_by_service is a value returned to us by the agent, and we are writing the client.

gbbr · 2019-02-18T13:54:08Z

I think there is still an assumption that the words "sample" and "rate" are known quantities

Sorry about that. That is many times the case when talking to someone who's been in a domain for too long, even though I try to avoid it.

what is the unit of the "rate"

It's the parameter of rate given to the formula described in my previous comment in the "Using the rate" section. I'm a bit confused as to which part is unclear. It is the specific rate that needs to be applied in the formula when sampling a trace having the given service and environment.

As I understand it, from reading the source code of this repository, the agent is enforcing that the value is [0.0, 1.0] so I don't know how a value of 2 is possible. Nor do I understand why you are saying "definitely don't set priority sampling of..." since the rate_by_service is a value returned to us by the agent, and we are writing the client.

I've created a confusion here. The only possible sampling rate can be [0.0, 1.0] as you say. For example 1.0 means 100% sampling and 0.3 means 30% sampling. The other value which is the sampling priority can be [-2, 2] as shown here.

The sampling rate: determines the rate at which a trace is sampled.
The sampling priority: reflects the sampling decision after the rate has been applied.

Feel free to reach out to me on our public Slack, it might be easier.

fommil · 2019-02-18T14:03:16Z

thanks again @gbbr . To be clear, I'm not talking about sampling priorities at all, I'm only interested in the rate_by_service field in the response from the agent to the client.

With that in mind, let's return to the specifics:

what is the unit of the "rate"? e.g. is it number of traces per second, number of spans per second, bytes submitted per second, spans/traces per PUT request, PUT requests per second, percentage of valid or retained spans, or something else? I have read your comment in detail, and the answer to this question is still not clear to me.
is the "rate" reporting on current values, or is it requesting desired values? e.g. by encouraging the client to either increase or decrease its rate of submissions.

fommil · 2019-02-20T09:16:20Z

I think I'm just going to take this away

Nothing will happen if you ignore it.

and ignore the field. Thanks anyways.

gbbr · 2019-02-20T09:20:33Z

Dear Sam,

The best and easiest way to write your own tracer is to just get inspired from an already existing one. The Go and the Python ones are a good choice.

And yes, it'll be fine to ignore it for now.

fommil · 2019-02-20T09:38:49Z

thanks @gbbr I really don't think I need to read the other examples, the v0.3/traces API is very simple and I read this datadog-agent code to understand how all the fields are validated (btw, a 10 minute maximum span length may cause problems for some of our longer running processes, you may wish to make that configurable, and some of the other limitations such as 2000-01-01 seem very arbitrary to me and could very well impact test scenarios. What constitutes a valid text or tag field is far too complex to reproduce on the client side, I'll have to pick a stricter subset when encoding it).

To implement the v0.4 API, all that I need to know is a technical definition of what "rate" means and how to act upon it. Given that "no action" is fine, understanding what it actually means seems to be irrelevant. I should point out that I still don't know if "rate" means:

number of traces per second
number of spans per second
bytes submitted per second
spans or traces per PUT request
PUT requests per second
percentage of valid or retained spans
or something else

There is an implicit assumption that this is obvious. If the meaning is in this list, I'd very much appreciate it if you could point it out. From a client implementer point of view, I can assure you that it is not obvious.

gbbr · 2019-02-20T09:48:34Z

It is none of that. Rate is just a number that clients are meant to use with the formula I've described above to mark a trace with a sampling priority in order to consistently get full distributed traces.

Clients are meant to decide whether a trace will be kept or not (in other words "sampled or not") by using that formula and the rate recommended by the agent.

If you look at how the formula works, you'll soon understand that a rate of 1 means 100% and a rate of 0.5 means 50% chance of being sampled.

I'm sorry but I'm not able to tell where the misunderstanding is and which part it is that you don't understand.

gbbr · 2019-02-20T09:52:26Z

how to act upon it

My very first comment is a detailed step-by-step guide on how to use the endpoint. I'm lost as to which part is confusing.

fommil · 2019-02-20T10:01:50Z

I'll jump on your slack as you suggested @gbbr ... perhaps this is better in realtime, although I was hoping to keep an audit of the conversation for our records.

fommil · 2019-02-20T10:48:55Z

Conclusion from slack:

"rate" is a number between [0.0, 1.0] indicating the desired percentage of traces that the agent wishes to downsample for a given service (0.0 meaning dropping everything, 1.0 meaning keeping everything).

Clients should act upon rate by setting a _sampling_priority_v1 field in metrics of the root span, which is an enum [-1, 0, 1, 2] indicating: -1) that the agent should drop the trace, 0) the agent may drop the trace, 1) the agent should try to keep the trace, or 2) the agent must keep the trace (subject to account limits).

fommil changed the title ~~Question:~~ Question: what is rate_by_services? Feb 13, 2019

olivielpeau added the team/agent-apm trace-agent label Feb 13, 2019

gbbr changed the title ~~Question: what is rate_by_services?~~ question: what is rate_by_service? Feb 18, 2019

fommil closed this as completed Feb 20, 2019

fommil mentioned this issue Mar 26, 2019

question: how to send correct payloads to trace-agent #3207

Closed

duncanpharvey mentioned this issue Jul 24, 2024

[Serverless Mini Agent] Run in Azure Spring Apps DataDog/libdatadog#547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: what is rate_by_service? #3031

question: what is rate_by_service? #3031

fommil commented Feb 13, 2019

gbbr commented Feb 18, 2019 •

edited

Loading

fommil commented Feb 18, 2019

gbbr commented Feb 18, 2019

fommil commented Feb 18, 2019 •

edited

Loading

gbbr commented Feb 18, 2019

fommil commented Feb 18, 2019 •

edited

Loading

fommil commented Feb 20, 2019

gbbr commented Feb 20, 2019 •

edited

Loading

fommil commented Feb 20, 2019 •

edited

Loading

gbbr commented Feb 20, 2019

gbbr commented Feb 20, 2019

fommil commented Feb 20, 2019

fommil commented Feb 20, 2019

question: what is rate_by_service? #3031

question: what is rate_by_service? #3031

Comments

fommil commented Feb 13, 2019

gbbr commented Feb 18, 2019 • edited Loading

Using the map

Using the rate

Applying the sampling decision

fommil commented Feb 18, 2019

gbbr commented Feb 18, 2019

fommil commented Feb 18, 2019 • edited Loading

gbbr commented Feb 18, 2019

fommil commented Feb 18, 2019 • edited Loading

fommil commented Feb 20, 2019

gbbr commented Feb 20, 2019 • edited Loading

fommil commented Feb 20, 2019 • edited Loading

gbbr commented Feb 20, 2019

gbbr commented Feb 20, 2019

fommil commented Feb 20, 2019

fommil commented Feb 20, 2019

gbbr commented Feb 18, 2019 •

edited

Loading

fommil commented Feb 18, 2019 •

edited

Loading

fommil commented Feb 18, 2019 •

edited

Loading

gbbr commented Feb 20, 2019 •

edited

Loading

fommil commented Feb 20, 2019 •

edited

Loading