Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Partitioning to Datadog Sink #88

Merged
merged 1 commit into from
Jan 9, 2023

Conversation

Areson
Copy link
Collaborator

@Areson Areson commented Jan 3, 2023

Added options for compression and partitioning to the Datadog sink when the API method is used. When receiving an error related to the payload size being too large the sink will attempt to calculate a possible size that will satisfy the size requirements and re-send the metrics. In the event that it exhausts options and cannot send the metrics the attempt will fail.

Added the option to turn on compression for the API, which allows much larger payloads by default.

@Areson Areson requested a review from daniel-nichter January 3, 2023 20:57
@Areson Areson force-pushed the ioberst-dd-sink-message-too-large branch from e69b2bd to 711ad7f Compare January 3, 2023 20:57

## API Considerations

When using the API directly the sink may run into issues when sending large payloads, as the Datadog API [restricts the payload to a maximum of 500Kib](https://docs.datadoghq.com/api/latest/metrics/). The DogStatsD option avoids this as it always sends metrics individually. By default the API option will attempt to automatically adjust the size of the payloads sent when it encounters an error related to a payload being too large. You can also explicitly set the number of metrics to send at once using `api-partition-size` and setting `api-auto-partition` to `false`. You can optionally turn on compression for the API, in which case Datadog will accept any payloads whose uncomressed size is less than 5Mib.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 KiB -- bytes, not bits, I presume?

@Areson Areson force-pushed the ioberst-dd-sink-message-too-large branch 2 times, most recently from e1aba3b to 80d352e Compare January 4, 2023 21:12
@@ -32,4 +32,5 @@ sinks:
metric-translator: ""
metric-prefix: ""
dogstatsd-host: ""
```
api-use-compression: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api-compression (or api-compress if you want to mimic the DD client). Blip doesn't use words like "enable", "use", etc. The only such on/off type word is the consistent prefix disable-....

sink/datadog.go Outdated
@@ -104,6 +119,9 @@ func NewDatadog(monitorId string, opts, tags map[string]string, httpClient *http
}
d.prefix = v

case "api-use-compression":
d.compress = strings.ToLower(v) == "true"
Copy link
Member

@daniel-nichter daniel-nichter Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sink/datadog.go Outdated
return pErr
}

if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be handled before checking r.StatusCode because err will be a low-level error (e.g. network issue), so there's no *http.Response. Afaik, if err is not nil, then there must be an *http.Response.

sink/datadog.go Outdated
monitorId: monitorId,
tags: tagList,
resources: resources,
partitionSize: math.MaxInt32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify "partition size" for future engineers who'll have to understand this code. The first confusing thing is partition (size of) what? We've got MAX_PAYLOAD_SIZE which is clear (max payload sizes are a common thing). So what's not clear is that partitionSize is actually something like "metrics per Send request", if I'm not mistaken? Then that plus starting with MaxInt32 makes it seems like we'll have to partition a lot (when needed) because 2^32= ~4 billion, yet we only deal with metrics in the 10s or 1000s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the name. My thought with starting with MaxInt32 was that we begin with no partitioning of the metrics. We only start to do that if we happen to run into the case where we get a error response back from Datadog indicating the payload is too large. With compression on that is much less likely to happen in general. MaxInt32 is just a nice default to set this to instead of just arbitrarily picking a value I thought was big enough.

sink/datadog.go Outdated
}

return err
// Figure out how many metrics can fit inside the max payload, but pad it slightly to control for headers, etc.
estPartitionSize := (MAX_PAYLOAD_SIZE - 1000) / size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another example of where partition size is confusing: comment says "how many metrics", but it's calculating a value in bytes (based off MAX_PAYLOAD_SIZE and current payload size). I don't see where code does something like "N metrics = B bytes, so new partition size is something fraction of N".

sink/datadog.go Outdated

if estPartitionSize >= s.partitionSize {
// The new partition size is greater than what we have now, so let's reduce it by a small factor to make sure we can send metrics properly
estPartitionSize = int(float32(s.partitionSize) * .9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment/code will need some clarification for future engineers. What does "properly" mean? And why 0.9? Maybe something like:

// Reduce partition size by 10% as a heuristic/guess for finding the maximum partition size
// (metrics per send) that'll fit within the DD payload size limit

sink/datadog.go Outdated
}

blip.Debug("error sending data points to Datadog: %s", err)
blip.Debug("error sending data points to Datadog: Http Response - %s", r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually, you need a return nil here to handle other error so you can return nil below. I.e. as the code is written now, it's mixing boilerplates, which is confusing. I.e. the bloilerplat,

err := f()
if err = nil {
    return err
}

isn't happening.

sink/datadog.go Outdated
blip.Debug("error sending data points to Datadog: Http Response - %s", r)
}

return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil because the if above already checked that it's not nil, so an explicit return nil is idiomatic to show that func has reached a successful end/return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By contrast, this kind of return err is used when a func returns the last return val of its last call without checking that return value.

sink/datadog.go Outdated
oldMaxMetricsPerRequest := s.maxMetricsPerRequest

// Sample one of the metrics and try caculating a new maxMetricsPerRequest
if pErr := s.updateMaxMetricsPerRequest(dp[0], len(dp)); pErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use pErr here in a block scope but then again later in another (outer) block scope. They have same special name but are different, which contradicts the normal expectation that any specifically-named var is the same var wherever the specific name occurs. (This is not true for nonspecifically named vars like the ubiquitous err in Go, and other classics like i in loops, etc.)

test/mock/tr.go Outdated
package mock

type Tr struct {
TranslateF func(domain, metric string) string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strike the trailing F so mock field name = func name. (Other mocks in Blip should follow this convention.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the other mocks favor using Func rather than just F. Match that then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction (as you noted on Slack): "Func" suffix.

@daniel-nichter
Copy link
Member

Squash commits please. GitHub does this thing where it generates release notes from git commits, which is super handy when doing releases. So we'll try to squash commits in the project to make cutting releases a little easier.

@Areson Areson force-pushed the ioberst-dd-sink-message-too-large branch from 5bf3a93 to dfbc000 Compare January 9, 2023 16:25
Added request partitioning to the Datadog sink when the API method is used. When receiving an error related to the payload size being too large the sink will attempt to calculate a possible size that will satisfy the size requirements and re-send the metrics. In the event that it exhausts options and cannot send the metrics the attempt will fail.

Added am option manage compression for the API (on by default).
@Areson Areson force-pushed the ioberst-dd-sink-message-too-large branch from dfbc000 to a99ec0d Compare January 9, 2023 16:28
@Areson Areson merged commit 4f58ee1 into main Jan 9, 2023
@Areson Areson deleted the ioberst-dd-sink-message-too-large branch January 9, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants