-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Partitioning to Datadog Sink #88
Conversation
e69b2bd
to
711ad7f
Compare
docs/v1.0/sinks/datadog.md
Outdated
|
||
## API Considerations | ||
|
||
When using the API directly the sink may run into issues when sending large payloads, as the Datadog API [restricts the payload to a maximum of 500Kib](https://docs.datadoghq.com/api/latest/metrics/). The DogStatsD option avoids this as it always sends metrics individually. By default the API option will attempt to automatically adjust the size of the payloads sent when it encounters an error related to a payload being too large. You can also explicitly set the number of metrics to send at once using `api-partition-size` and setting `api-auto-partition` to `false`. You can optionally turn on compression for the API, in which case Datadog will accept any payloads whose uncomressed size is less than 5Mib. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500 KiB
-- bytes, not bits, I presume?
e1aba3b
to
80d352e
Compare
docs/v1.0/sinks/datadog.md
Outdated
@@ -32,4 +32,5 @@ sinks: | |||
metric-translator: "" | |||
metric-prefix: "" | |||
dogstatsd-host: "" | |||
``` | |||
api-use-compression: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api-compression
(or api-compress
if you want to mimic the DD client). Blip doesn't use words like "enable", "use", etc. The only such on/off type word is the consistent prefix disable-...
.
sink/datadog.go
Outdated
@@ -104,6 +119,9 @@ func NewDatadog(monitorId string, opts, tags map[string]string, httpClient *http | |||
} | |||
d.prefix = v | |||
|
|||
case "api-use-compression": | |||
d.compress = strings.ToLower(v) == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d.compress = blip.Bool(v)
https://pkg.go.dev/github.com/cashapp/blip#Bool
sink/datadog.go
Outdated
return pErr | ||
} | ||
|
||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be handled before checking r.StatusCode
because err
will be a low-level error (e.g. network issue), so there's no *http.Response
. Afaik, if err is not nil, then there must be an *http.Response
.
sink/datadog.go
Outdated
monitorId: monitorId, | ||
tags: tagList, | ||
resources: resources, | ||
partitionSize: math.MaxInt32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify "partition size" for future engineers who'll have to understand this code. The first confusing thing is partition (size of) what? We've got MAX_PAYLOAD_SIZE
which is clear (max payload sizes are a common thing). So what's not clear is that partitionSize
is actually something like "metrics per Send request", if I'm not mistaken? Then that plus starting with MaxInt32 makes it seems like we'll have to partition a lot (when needed) because 2^32= ~4 billion, yet we only deal with metrics in the 10s or 1000s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the name. My thought with starting with MaxInt32 was that we begin with no partitioning of the metrics. We only start to do that if we happen to run into the case where we get a error response back from Datadog indicating the payload is too large. With compression on that is much less likely to happen in general. MaxInt32 is just a nice default to set this to instead of just arbitrarily picking a value I thought was big enough.
sink/datadog.go
Outdated
} | ||
|
||
return err | ||
// Figure out how many metrics can fit inside the max payload, but pad it slightly to control for headers, etc. | ||
estPartitionSize := (MAX_PAYLOAD_SIZE - 1000) / size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another example of where partition size is confusing: comment says "how many metrics", but it's calculating a value in bytes (based off MAX_PAYLOAD_SIZE and current payload size). I don't see where code does something like "N metrics = B bytes, so new partition size is something fraction of N".
sink/datadog.go
Outdated
|
||
if estPartitionSize >= s.partitionSize { | ||
// The new partition size is greater than what we have now, so let's reduce it by a small factor to make sure we can send metrics properly | ||
estPartitionSize = int(float32(s.partitionSize) * .9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment/code will need some clarification for future engineers. What does "properly" mean? And why 0.9? Maybe something like:
// Reduce partition size by 10% as a heuristic/guess for finding the maximum partition size
// (metrics per send) that'll fit within the DD payload size limit
sink/datadog.go
Outdated
} | ||
|
||
blip.Debug("error sending data points to Datadog: %s", err) | ||
blip.Debug("error sending data points to Datadog: Http Response - %s", r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually, you need a return nil
here to handle other error so you can return nil
below. I.e. as the code is written now, it's mixing boilerplates, which is confusing. I.e. the bloilerplat,
err := f()
if err = nil {
return err
}
isn't happening.
sink/datadog.go
Outdated
blip.Debug("error sending data points to Datadog: Http Response - %s", r) | ||
} | ||
|
||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil
because the if
above already checked that it's not nil, so an explicit return nil
is idiomatic to show that func has reached a successful end/return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By contrast, this kind of return err
is used when a func returns the last return val of its last call without checking that return value.
sink/datadog.go
Outdated
oldMaxMetricsPerRequest := s.maxMetricsPerRequest | ||
|
||
// Sample one of the metrics and try caculating a new maxMetricsPerRequest | ||
if pErr := s.updateMaxMetricsPerRequest(dp[0], len(dp)); pErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use pErr
here in a block scope but then again later in another (outer) block scope. They have same special name but are different, which contradicts the normal expectation that any specifically-named var is the same var wherever the specific name occurs. (This is not true for nonspecifically named vars like the ubiquitous err
in Go, and other classics like i
in loops, etc.)
test/mock/tr.go
Outdated
package mock | ||
|
||
type Tr struct { | ||
TranslateF func(domain, metric string) string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strike the trailing F
so mock field name = func name. (Other mocks in Blip should follow this convention.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the other mocks favor using Func
rather than just F
. Match that then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction (as you noted on Slack): "Func" suffix.
Squash commits please. GitHub does this thing where it generates release notes from git commits, which is super handy when doing releases. So we'll try to squash commits in the project to make cutting releases a little easier. |
5bf3a93
to
dfbc000
Compare
Added request partitioning to the Datadog sink when the API method is used. When receiving an error related to the payload size being too large the sink will attempt to calculate a possible size that will satisfy the size requirements and re-send the metrics. In the event that it exhausts options and cannot send the metrics the attempt will fail. Added am option manage compression for the API (on by default).
dfbc000
to
a99ec0d
Compare
Added options for compression and partitioning to the Datadog sink when the API method is used. When receiving an error related to the payload size being too large the sink will attempt to calculate a possible size that will satisfy the size requirements and re-send the metrics. In the event that it exhausts options and cannot send the metrics the attempt will fail.
Added the option to turn on compression for the API, which allows much larger payloads by default.