Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API shows incorrect results #8

Closed
vadasambar opened this issue Feb 9, 2022 · 19 comments
Closed

Metrics API shows incorrect results #8

vadasambar opened this issue Feb 9, 2022 · 19 comments

Comments

@vadasambar
Copy link
Collaborator

vadasambar commented Feb 9, 2022

Problem

Datadog metrics API does not show me the correct results.
I have observed the following cases:

  1. Sometimes I get no data in the metric API but I can see the data in the UI when I execute the same query with the from_date and
    to_date from the API response
    nodata-code
    nodata-ui
  2. Sometimes the data I get in the metric API does not match with the data in the UI when I execute the same query with the from_date and to_date from the API respose
    wrongdata-code
    wrongdata-ui
  3. Sometimes I get the correct data.
    1 and 2 are easily reproducible if you execute the code I have attached a couple of times. I have set the query time frame to 24s duration but if I increase it to say to 30s, I get data which is close to the data I see in the UI OR I get 1 (i.e., no data when I can see the data in the UI)

Please check the attached zip file which contains the code and the screenshots.

I am using a local minikube cluster (runs in a VM). I have installed datadog using

helm install my-datadog-operator datadog/datadog-operator

My current datadog chart versioni is datadog-operator-0.7.8.

Code

go.mod
module datadog-api

go 1.16

require github.com/DataDog/datadog-api-client-go v1.6.0
go.sum
cloud.google.com/go v0.34.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw=
github.com/DataDog/datadog-api-client-go v1.6.0 h1:ccMzM4vw37/8ww9VKKydWMrI+xEs0uE13O5mkG9Ny/8=
github.com/DataDog/datadog-api-client-go v1.6.0/go.mod h1:QzaQF1cDO1/BIQG1fz14VrY+6RECUGkiwzDCtVbfP5c=
github.com/golang/protobuf v1.2.0 h1:P3YflyNX/ehuJFLhxviNdFxQPkGK5cDcApsge1SqnvM=
github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
golang.org/x/net v0.0.0-20180724234803-3673e40ba225/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190108225652-1e06a53dbb7e h1:bRhVy7zSSasaqNksaRZiA5EEI+Ei4I1nO5Jh72wfHlg=
golang.org/x/net v0.0.0-20190108225652-1e06a53dbb7e/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d h1:TzXSXBo42m9gQenoE3b9BGiEpg5IG2JkU5FkPIawgtw=
golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20210220032951-036812b2e83c h1:5KslGYwFpkhGh+Q16bwMP3cOontH8FOep7tGV86Y7SQ=
golang.org/x/sync v0.0.0-20210220032951-036812b2e83c/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
google.golang.org/appengine v1.4.0 h1:/wp5JvzpHIxhs/dumFmF7BXTf3Z+dd4uXta4kVyO508=
google.golang.org/appengine v1.4.0/go.mod h1:xpcJRLb0r/rnEns0DIKYYv+WjYCduHsrkT7/EB5XEv4=
main.go
// Query timeseries points returns "OK" response

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"os"
	"time"

	datadog "github.com/DataDog/datadog-api-client-go/api/v1/datadog"
)

func main() {
	ctx := datadog.NewDefaultContext(context.Background())
	configuration := datadog.NewConfiguration()
	apiClient := datadog.NewAPIClient(configuration)

	var from int64 = time.Now().Unix()
	fmt.Println("from", from)
	time.Sleep(time.Second * 24)
	var to int64 = time.Now().Unix()
	fmt.Println("to", to)
	resp, r, err := apiClient.MetricsApi.QueryMetrics(ctx, from, to, "avg:system.load.1{*}.rollup(avg, 24)")

	if err != nil {
		fmt.Fprintf(os.Stderr, "Error when calling `MetricsApi.QueryMetrics`: %v\n", err)
		fmt.Fprintf(os.Stderr, "Full HTTP response: %v\n", r)
	}

	responseContent, _ := json.MarshalIndent(resp, "", "  ")
	fmt.Fprintf(os.Stdout, "Response from `MetricsApi.QueryMetrics`:\n%s\n", responseContent)
}

How to run

$ export DD_API_KEY="<your-datadog-api-key>" DD_APP_KEY=<your-datadog-app-key> DD_SITE=<your-datadog-site>
$ go run main.go

Fix

@vadasambar
Copy link
Collaborator Author

vadasambar commented Feb 21, 2022

📝

The default collection interval for all Datadog standard integrations is 15 seconds.

https://docs.datadoghq.com/getting_started/integrations/#collection-interval

Find below a summary of Datadog data collection, resolution, and retention:

Product category Source Collection Methods Collection interval Minimum Resolution Default Retention
Database Monitoring Query Metrics Datadog Agent + enabled integrations 10 seconds 1 second 3 months

https://docs.datadoghq.com/developers/guide/data-collection-resolution-retention/

Not sure how Query Metrics differs from the default collection interval for all Datadog standard integrations here 🤔

Thread on datadog slack: https://datadoghq.slack.com/archives/C3BM8SGKZ/p1645426510859019

@vadasambar
Copy link
Collaborator Author

Vincenzo Rajo (February 08, 2022 12:31)

Hi Suraj,

Thanks for reaching out today! My name is Vincenzo and I'm part of the Solutions leadership team at Datadog.

I’ve taken a first look at your case and, based on your requirements, have routed it to one of my colleagues. You should expect to hear from our team once an initial investigation has been done.

Kind Regards,

Vincenzo Rajo - Solutions Engineering Team Lead APAC

Datadog Support Platform
Datadog Documentation

Suraj Banakar (February 11, 2022 11:39)

Hi Vincenzo,

Thank you for the response.

Did the team get a chance to look at it yet? If what I am talking about turns out to be a real issue, I think it might impact other customers as well.

Thank you,
Suraj

image

@vadasambar
Copy link
Collaborator Author

Vincenzo Rajo (February 11, 2022 11:57)

Hi Suraj,

Your request has been routed to our team of Solutions Engineers to review. Our apologies for the delay you've encountered.

Thanks in advance for your understanding and patience.

Kind Regards,

Vincenzo Rajo - Solutions Engineering Team Lead APAC

Datadog Support Platform
Datadog Documentation

Suraj Banakar (February 16, 2022 12:26)

Hi Vincenzo,

Any updates on this?

Regards,
Suraj

image

@vadasambar
Copy link
Collaborator Author

Suraj Banakar (February 21, 2022 13:16)

Not sure if this is related at all

The default collection interval for all Datadog standard integrations is 15 seconds.

https://docs.datadoghq.com/getting_started/integrations/#collection-interval

Find below a summary of Datadog data collection, resolution, and retention:
Product category | Source | Collection Methods | Collection interval | Minimum Resolution | Default Retention
Database Monitoring | Query Metrics | Datadog Agent + enabled integrations | 10 seconds | 1 second | 3 months

https://docs.datadoghq.com/developers/guide/data-collection-resolution-retention/

Not sure how Query Metrics differs from the default collection interval for all Datadog standard integrations here 🤔

Thread on datadog slack: https://datadoghq.slack.com/archives/C3BM8SGKZ/p1645426510859019

Suraj Banakar (February 23, 2022 12:32)

Hi Vincenzo,
I would really appreciate an update on this ticket.

Thank you,
Suraj

image

@vadasambar
Copy link
Collaborator Author

vadasambar commented Mar 2, 2022

image

It has been 22 days since I created the support ticket. I haven't received any meaningful response from Datadog support which addresses the problem I have outlined in this issue. I am giving up on this for now because I don't think I am going to get a response. If I do get a response, I will post it as a comment on this issue.

If you face an issue where the metrics being displayed in Datadog don't match with the results in Keptn, please create a new issue, link this issue and mention me. Sorry for the inconvenience and thank you for understanding. 🙇

@vadasambar
Copy link
Collaborator Author

Received a response!

Dustin Rothschild (April 14, 2022 04:53)

Hi Suraj,

Thanks so much for reaching out, and my apologies for the delay here. My name is Dustin and I'm part of the technical solutions team here at Datadog, I'll be assisting you further with this.

To begin, it sounds like you're seeing different values for the same metric query when queried in the UI vs. the API, is this correct? If so, would you mind sending a link over to the notebook you're referencing here? I was able to view your screenshots in your original correspondence, however, was unable to replicate this on my end.

However, if you're viewing different data for the same query in a notebook vs. API, this could be related to default interpolation linear.
image

Essentially, default interpolation linear works by taking the avg of point A and point C to plot a value for point B. This "draws a straight line" between the two points. As an example, let's say we have the following timeseries:

[3, N/A, 6]

fill(linear) would do:

(3 + 6) / 2 = 4.5

Let me know if this information is helpful and otherwise, I believe a link to a notebook you're referencing for this will be very helpful for further investigation here. Thanks again for your help and patience with working through this and I hope you're having a great week.

Kind regards,

Dustin Rothschild | Solutions Engineer | Datadog

Datadog Support Platform
Datadog Documentation

image
image

@vadasambar
Copy link
Collaborator Author

Suraj Banakar (April 19, 2022 12:02)

Hi Dustin,

Thank you for reaching out.

So the problem happens in the following scenario:

  1. My K8s cluster is emitting metrics
  2. I query the latest metrics from the API which I do by setting toDate to the current date/time and fromDate to 15-30s minus the current date/time
  3. API result is different from the UI
  4. Wait 30s and run the API query (aka the code which does this) again and you see the same result as the UI

If you do the above, you should be able to reproduce the issue. I just used the default system.load.1 query that you get when you create a new notebook. You can try that.

Is the default interpolation applied for both the UI and API? If it's say applied on the UI and not in the API there might be some correlation but I don't see how that would explain 4 (maybe I'm missing something?).

Regards,
Suraj

2022年4月18日(月) 午後8:36 Datadog Support <support@datadog.zendesk.com>:

image

@vadasambar
Copy link
Collaborator Author

Dustin Rothschild (April 21 2022 01:11)

Thanks for the follow up and apologies for the delay here as I was out sick for a few days. While investigating this further, I believe this is related to time aggregation and the .rollup(24) being placed here with interpolation.

In your previous file wrongdata-code.png, this query is over 24 seconds from Monday, February 7, 2022 10:27:26 PM (US - West Time) to Monday, February 7, 2022 10:27:50 PM
image
When we look at this in a notebook without .rollup(24), we can see that there is a value of 0.18 (as is shown in the output above in your API call), and 15 seconds later, there is a value of 0.14 as shown here.
image
image
When applying the .rollup(24) in your original query, this changes to 0.16 as you noted in your original correspondence.
image
These avg of 0.18 and 0.14 = 0.16. Therefore, it appears that this .rollup() is grouping these two values together, and taking the average of them. Whereas in your API call, the timepoint ends at 10:27:50 and is unable to rollup the 0.14 value. Let me know if this makes more sense or if you have any further questions / requests regarding this issue, and I hope you're having a great week.

Thank you for your time,

Dustin Rothschild | Solutions Engineer | Datadog

Datadog Support Platform
Datadog Documentation

image
image
image
image
image
image

@vadasambar
Copy link
Collaborator Author

Suraj Banakar (April 21 2022 14:23)

Hey Dustin,
Hope you are feeling better.

I am still confused about this to be honest.
I am confused about how interpolation plays a role here:

  1. I tell the API to get me the value of `avg:system.load.1{*}.rollup(24) with fromDate and toDate = fromDate+24s. Since there is only 1 datapoint in this range I get 0.18 as the value
  2. I check the UI with the same query and the same fromDate and toDate but I see a different value until some time (30s) passes and the API starts reflecting the same value

My understanding is, no matter how interpolation works, if it works the same for both UI and API I shouldn't have to wait 30s for the API to start reflecting the values I see in the UI properly.

If getting on a short call might help speed up our interaction, let me know!

image

@vadasambar
Copy link
Collaborator Author

Dustin Rothschild (April 28 2022 22:51)

Hi Suraj,

Thanks again for your patience and your kind words. I discussed this further with our team experts and confirmed that our API actually has 1 second granularity which is why you're seeing 0.18 here at this one second interval instead of 0.16 in the UI with the .rollup(24). My apologies for my previous correspondence. Let me know if this information is more helpful or if you have any further questions / requests regarding this issue and I hope you're having a great week.

Thank you for your time,

Dustin Rothschild | Solutions Engineer | Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

Suraj Banakar (April 29, 2022 11:51)

Hi Dustin,
Thank you for the response and reaching out to your team.

I think your response still doesn't answer my question around the delay. Maybe I am not explaining it well enough. My problem is really the delay in the metrics API.
Example:

  1. Time right now: 12:00:00, avg:system.load.1{*} is 0.5 for example
  2. 15 seconds pass (<- this can be any amount of time)
  3. Time right now: 12:00:15, avg:system.load.1{*} is 0.8 for example (let's assume there were no data points after 0.5)
  4. I query the metrics API at 12:00:15 for avg:system.load.1{*}.rollup(15) from 12:00:00 to 12:00:15. This should give me 0.5 + 0.8 / 2 = 0.65 as the result (correct me if I am wrong. Even if my calculation is wrong, I think it doesn't matter because the problem is something else)
  5. At 12:00:15 instead of 0.65 (or whatever the correct value is), I get 0.56 (a lesser value). When I run the same query in datadog notebook with the same start and end time, I get 0.65 (i.e., the correct value)
  6. I retry the query with the metrics API for 12:00:00 to 12:00:15 at 12:00:20 (i.e., 5 seconds later), I get a result like 0.6 (closer to the correct value). UI shows 0.65 correctly
  7. I retry the query with the metrics API for 12:00:00 to 12:00:15 at 12:00:30 (5+ 10 = 15 seconds later), I get a result like 0.63 (closer to the correct value). UI shows 0.65 correctly
  8. I keep retrying the query with the metrics API until I get the correct value. I usually get correct value 30s after 12:00:15

What I want to know is, is this 30s delay expected and does it apply to all other metrics as well? i.e., metrics other than system.load.1 because I tried the scenario mentioned in 1-8 for a couple of other metrics and I saw the same problem there as well.

Thank you for your time and have a great weekend,
Suraj

image

@vadasambar
Copy link
Collaborator Author

Had a conversation around this issue in the Keptn slack. Dynatrace and Sumo Logic also have similar problem which makes me think keeping the delay might be right way to do things.

Relevant excerpts from the slack thread:

Suraj Banakar

Context: I am trying a PoC for retrieving the result of a custom query from Sumo Logic API. This is for Sumo Logic SLI integration. Here's what the PoC does in a nutshell:

var customQuery = "<custom-query-here>"
var from int64 = time.Now().Unix()
time.Sleep(time.Second * 60)
var to int64 = time.Now().Unix()
result, _ := querySumoLogicAPI(customQuery)

I did a similar PoC for Datadog. The problem I have faced in both the cases is, result is often 0 or a value which is not the same as what is shown in the UI. I need to wait for some time before I start getting the correct result.

time.Sleep(time.Second * 30)
result, _ := querySumoLogicAPI(customQuery)

When I was writing the integration for datadog I thought it was a Datadog problem but now I am seeing a similar problem in Sumo Logic as well.
I just wanted to ask if the community has faced a similar problem with our other SLI integrations which query remote SLI provider e.g., Dynatrace or SignalFX. I checked the code for Dynatrace SLI integration but I don't see any wait/sleep or exponential backoff being used anywhere.

Andreas Grabner

hi there. I think this is a problem of ALL tools. In Dynatrace we do actually wait - its 2 minutes. But - that wait is only enforced if you are querying data that is "younger" than 2 minutes. So - if you query the last 10 minutes then we would wait 2 minutes to make sure dynatrace has all data available.
I think some type of wait is just necessary because none of those observability tool will have "instant data" avialalbe. hope this helps

Christian Kreuzberger

Not all tools 😛 We don't face this problem with prometheus-service. But as Andi pointed out, the problem exists in many tools, including Dynatrace. I will share a code snippet with dynatrace-service with you.
https://github.com/keptn-contrib/dynatrace-service/blob/ab120e8373e03c8a5f2c7803c179ff6b7bfb71d7/internal/sli/get_sli_triggered_event_handler.go#L69-L116
Mind that this code might be outdated, and just represents the state Andi and I had it 2020 😄

Suraj Banakar

Thank you Christian! Looks like we are defining the delay here in the code now.

@vadasambar
Copy link
Collaborator Author

Fatine Bentires (May 09, 2022 15:19)

Hi Suraj,

Thanks for your patience while we have been investigating this.

My name is Fatine, I’m a Tier 2 Metrics Product Specialist who has been working closely with engineering on your issue. This is a quick note to let you know that I'll be handling this ticket and all communications directly moving forward to streamline our investigation process, and get your issue resolved as quickly as possible.

All the best,

Fatine | Solutions Engineer, T2 - Metrics @ Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

Fatine Bentires (May 12, 2022 18:30)

Hey Suraj,

Following up on your request. Thanks for your patience!

My understanding is that you need to wait ~30seconds before seeing the same result in the API's output as what you get in the UI. Would you mind letting us know if you are experiencing the same behaviour without adding .rollup() to your API query? Having the output of your API script as well as a link to the corresponding value in the Notebook would be really appreciated, and will help us with further investigation.

Looking forward to hearing from you!

All the best,

Fatine | Solutions Engineer, T2 - Metrics @ Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

Fatine Bentires (May 16, 2022 12:46)

Hi Suraj,

Following our conversation, I'm reaching out to see if you received my last message and if you still need assistance regarding this case. Please feel free to reach out again!

Best regards,

Fatine | Solutions Engineer, T2 - Metrics @ Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

Fatine Bentires (May 18, 2022 14:09)

Hi Suraj,

I haven't heard back from you so I am going to close your ticket for now.
However, if you would like to open it back up, please reply to this.

Best regards,

Fatine | Solutions Engineer, T2 - Metrics @ Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

vadasambar commented Nov 22, 2022

Suraj Banakar (May 19, 2022 12:39)

Hi Fatine,
Sorry for the late reply.

My understanding is that you need to wait ~30seconds before seeing the same result in the API's output as what you get in the UI.

This basically answers my question. Thank you.

I am fine with closing this ticket.

Regards,
Suraj

image

@vadasambar
Copy link
Collaborator Author

Fatine Bentires (May 25, 2022 13:29)

Hey Suraj,

Apologies for the delay!

Since your request is fulfilled, I am going to close your ticket for now.
However, if you would like to open it back up, please reply to this.

Best regards,

Fatine | Solutions Engineer, T2 - Metrics @ Datadog

Datadog Support Platform
Datadog Documentation

image

@vadasambar
Copy link
Collaborator Author

This confirms

I have set the query time frame to 24s duration but if I increase it to say to 30s, I get data which is close to the data I see in the UI OR I get 1 (i.e., no data when I can see the data in the UI)

30s+ delay seems like the right solution to fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant