Add Datadog contrib for monitoring purpose #2434

thisiscab · 2018-05-30T17:47:53Z

Description

Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Motivation and Context

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allows us to easily create dashboards to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
#2044.

Have you tested this? If so, how?

We've been using this contrib for multiple months now (maybe a year?), at Glossier. This is the main point of reference to see the health of our pipeline.

thisiscab · 2018-05-30T18:06:25Z

Interesting, I was certain that I had run the spec suite locally to make sure that all was well.
I'll make sure that the problems are resolved ASAP!

dlstadther

Is there any way to add tests of this contrib module?

thisiscab · 2018-05-30T18:35:52Z

@dlstadther I'll take a look at how we could test this contrib without it being painful. Thanks for your input!

thisiscab · 2018-06-06T15:38:43Z

@dlstadther Hey! I've added a bunch of fun tests to make sure everything was working and I'm having a bunch of trouble running the spec suite locally. When running each individual new tests locally all works, but I want to make sure that running the whole suite works fine.

Travis seems to always fail on the docs job and the error being displayed is not something I would understand how to fix, do you have suggestions?
Travis seems to have timeouted on some specs and I don't have the power to re-run the suite. What should I do under that situation? Would you want me to rebase to trigger a re-build or shall I ask you to explicitly re-run the spec suite?

Let me know how to proceed further! :)

dlstadther · 2018-06-06T20:57:23Z

I've not seen that doc failure before.

I've restarted the build and will review again after the build completes. (Thanks for your efforts!)

dlstadther · 2018-06-06T21:04:00Z

docs tests are still failing with same error

Are you able to build the docs locally? tox -e docs

thisiscab · 2018-06-07T15:58:01Z

@dlstadther Docs are running file locally:

tox -e docs yields

_____________________________________________________ summary _________________________________________________________
  docs: commands succeeded
  congratulations :)

Also when I'm running the tests locally that seems to hang on Travis, they all succeed. Not sure what this is all about :/

I think it's worth investigating deeper.

dlstadther · 2018-06-07T18:27:42Z

@Tarrasch Have you experienced this sort of Travis doc failure or test exit before?

Tarrasch · 2018-06-16T16:55:16Z

No idea, I restarted an older build to know if it's due to changes in this PR or not. Maybe Travis just had a bad day?

dlstadther · 2018-07-09T11:28:15Z

@cabouffard Could you resolve conflicts and let's see if the build can be resolved!

thisiscab · 2018-07-09T16:43:13Z

@dlstadther I've tried many different things, I can't figure out why the specs, with the addition of this PR, keeps hanging on Travis.

I also can't figure out why the docs keep failing on Travis while I can run it on my local machine without any problems.

When I try to run tox -e py27-nonhdfs on my local machine, I have a bunch of different specs that fail due to many different kinds of errors. Here is a gist that shows what is failing locally: https://gist.github.com/cabouffard/5e20a4cf787752cb0caebdbc467e9ac0.

Fortunately, those specs aren't failing on Travis, but I can't reproduce Travis' behavior locally. I would need help on this.

dlstadther · 2018-07-09T17:34:37Z

This is bizarre... Nothing is standing out to me as an issue and yet there is clearly a problem

thisiscab · 2018-07-10T15:02:10Z

@dlstadther What would you suggest the next step be? Can you investigate the issue on your side, can any other collaborator jump and give us their point of view on the matter?

Thanks! :)

thisiscab · 2018-07-20T14:12:22Z

@dlstadther Hey, what's the latest on this one? :)

dlstadther · 2018-07-20T15:03:21Z

So sorry @cabouffard - been kinda busy and unintentionally neglected this. Hoping to have some more time next week - i'll set myself a reminder! Thanks for your patience :)

luigi/metrics.py

tox.ini

dlstadther · 2018-07-20T18:19:13Z

(Found a little time today to look into this briefly)

@cabouffard I've downloaded this branch locally and tried building docs (tox -e docs) on 2.7.15 and 3.7.0.

2.7.15 failed with the import error.
3.7.0 was successful.

(Spit balling here a bit...) I'm inclined to suspect the from enum import Enum Python 3.4+ module import as the source of the issue here. And the error is trickling down to fail imports.

I've had personal experience with unittest where I've received misleading errors due to package import errors.

Mind looking into a 3.4- solution for Enum and see if that resolves the issues here?

Thanks!

thisiscab · 2018-08-02T16:34:08Z

It has been my turn to be busy! Very clever find about the ENUM, I would have expected a better error message. Thank for taking your time to help me out on this! I'll make the additional changes right now! :)

Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.

I've also added a few test to ensure that the implementation was working well.

This takes care of ensuring that the proper metrics collection calls are being done when they are expected to be happening. We've also removed a few `@RPC_METHOD` that weren't actually being used and that wasn't required.

This makes sure that we're properly dispatching API and STATSD call with the proper parameter values to Datadog. This doesn't test all the different possible parameters configuration.

This adds a few extra documentation line for the configuration to allow user to find all the settings they can tweak for each individual contribs instead of having to go through each individual contrib files.

The original implementation was made when 0.16.0 was the latest version. Since there there have been a few improvements and bug fixes made to the library that we should be using. Reading through the release log there shouldn't be any feature-breaking changes so we should be good to update it!

Previously, the getter wasn't a class method and wouldn't work as expected. In order to ensure that the output is what we expect, we've added more tests.

There was multiple problems that needed to be solved in order to get the specs green again. Each individual specs were passing when ran individually, but when ran into tox as a group, some of them would pass and other would fail depending the tox environment. It came to my attention that the time function of this file, was creating an issue with other specs because we were not tearDowning it as expected. Also, using setTime within the setUp group had side effects with unexpected behaviors. Then, the way way that the task_id and task_family was named was also causing problems with the same spec that were failing prior. I'm unsure why this would be the case, but changing either fail, but changing both makes the spec to green. Finally, the last spec would always fail because the setTime was set AFTER the task was actually being run, which would always cause the execution time to be greater than 0. My understanding of all of this is still a bit fuzzy, but hey, now the spec suite passes.

luigi/metrics.py

This will force people to implement this methods of this class when they refer to it.

This allows for less-strict function calls.

thisiscab · 2018-12-17T19:51:31Z

@dlstadther Sorry for all the force pushes, I've had trouble with my local environment so I decided to test my commits directly on Travis :)

Now that the specs are all passing, I've tested this branch on our development system and all seems to work accordingly. I think we can finally 🚢 it!

Thank you so much for the help!

luigi/contrib/datadog_metric.py

The underlying configuration of the Datadog metrics collector is a property, so it makes more sense that it's also a property when used within the class itself.

dfeldstarsky · 2018-12-17T20:32:15Z

Hey yall - just wanna say that we're really excited to be able to use this soon! Good luck getting it past the finish line, can't wait to try to measure our pipeline w/ this integration.

dlstadther · 2018-12-17T20:33:32Z

Thanks for the long, hard work @cabouffard !

thisiscab · 2018-12-17T20:37:46Z

🎉

thisiscab force-pushed the feature/add-datadog-contrib branch from 5ed031c to 3d4f9d2 Compare May 30, 2018 18:23

dlstadther reviewed May 30, 2018

View reviewed changes

thisiscab force-pushed the feature/add-datadog-contrib branch 4 times, most recently from a87b877 to b5de03b Compare June 4, 2018 20:11

thisiscab force-pushed the feature/add-datadog-contrib branch from 0ff975f to 700172d Compare June 7, 2018 15:49

thisiscab force-pushed the feature/add-datadog-contrib branch 2 times, most recently from e4b6c0c to 0d30791 Compare July 9, 2018 16:39

dlstadther reviewed Jul 20, 2018

View reviewed changes

luigi/metrics.py Show resolved Hide resolved

dlstadther reviewed Jul 20, 2018

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

dlstadther mentioned this pull request Jul 31, 2018

Visible parameter for luigi.Parameters #2278

Merged

thisiscab force-pushed the feature/add-datadog-contrib branch from 0d30791 to e09888c Compare August 2, 2018 16:47

thisiscab requested a review from Tarrasch as a code owner August 2, 2018 16:47

thisiscab added 9 commits December 13, 2018 10:19

Refactor MetricsCollectors in Scheduler

14c1ebb

I've also added a few test to ensure that the implementation was working well.

Add polish + tests around metrics on task state

bacae6e

This takes care of ensuring that the proper metrics collection calls are being done when they are expected to be happening. We've also removed a few `@RPC_METHOD` that weren't actually being used and that wasn't required.

Add tests related to the Datadog contrib

7668cf0

This makes sure that we're properly dispatching API and STATSD call with the proper parameter values to Datadog. This doesn't test all the different possible parameters configuration.

Improve configuration documentation with new Datadog contrib

6556b3c

This adds a few extra documentation line for the configuration to allow user to find all the settings they can tweak for each individual contribs instead of having to go through each individual contrib files.

Change metrics collection getter to class method

a67f2af

Previously, the getter wasn't a class method and wouldn't work as expected. In order to ensure that the output is what we expect, we've added more tests.

Refactor the datadog_metric tests

7756b37

thisiscab force-pushed the feature/add-datadog-contrib branch from 84498cf to 7756b37 Compare December 13, 2018 15:20

dlstadther reviewed Dec 13, 2018

View reviewed changes

luigi/metrics.py Outdated Show resolved Hide resolved

thisiscab force-pushed the feature/add-datadog-contrib branch from 98ad6dc to a7df8f0 Compare December 14, 2018 16:02

thisiscab added 2 commits December 14, 2018 11:22

Abstract MetricsCollector class

a7f76fc

This will force people to implement this methods of this class when they refer to it.

Kwargs on the _send_event call

5ba5917

This allows for less-strict function calls.

thisiscab force-pushed the feature/add-datadog-contrib branch from a14df58 to 5ba5917 Compare December 14, 2018 16:23

Fix metrics collector

d387b99

thisiscab force-pushed the feature/add-datadog-contrib branch from 3e04070 to d387b99 Compare December 14, 2018 18:44

Change DataDog scheduler_api_tests

e876d9c

thisiscab force-pushed the feature/add-datadog-contrib branch from 7d9077f to e876d9c Compare December 17, 2018 19:26

dlstadther reviewed Dec 17, 2018

View reviewed changes

luigi/contrib/datadog_metric.py Show resolved Hide resolved

Change default_tags of DatadogMetricsCollector to a property

d43c478

The underlying configuration of the Datadog metrics collector is a property, so it makes more sense that it's also a property when used within the class itself.

dlstadther approved these changes Dec 17, 2018

View reviewed changes

dlstadther merged commit b2a5759 into spotify:master Dec 17, 2018

victoriaalee mentioned this pull request Jan 17, 2019

Add Prometheus contrib for monitoring purpose #2628

Merged

thisiscab deleted the feature/add-datadog-contrib branch January 25, 2019 20:00

tophers42 mentioned this pull request Jul 12, 2019

Datadog metrics collector sends all params as tags #2738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Datadog contrib for monitoring purpose #2434

Add Datadog contrib for monitoring purpose #2434

thisiscab commented May 30, 2018

thisiscab commented May 30, 2018

dlstadther left a comment

thisiscab commented May 30, 2018

thisiscab commented Jun 6, 2018

dlstadther commented Jun 6, 2018

dlstadther commented Jun 6, 2018

thisiscab commented Jun 7, 2018 •

edited

Loading

dlstadther commented Jun 7, 2018

Tarrasch commented Jun 16, 2018

dlstadther commented Jul 9, 2018

thisiscab commented Jul 9, 2018

dlstadther commented Jul 9, 2018

thisiscab commented Jul 10, 2018

thisiscab commented Jul 20, 2018

dlstadther commented Jul 20, 2018

dlstadther commented Jul 20, 2018

thisiscab commented Aug 2, 2018

thisiscab commented Dec 17, 2018

dfeldstarsky commented Dec 17, 2018

dlstadther commented Dec 17, 2018

thisiscab commented Dec 17, 2018

Add Datadog contrib for monitoring purpose #2434

Add Datadog contrib for monitoring purpose #2434

Conversation

thisiscab commented May 30, 2018

Description

Motivation and Context

Have you tested this? If so, how?

thisiscab commented May 30, 2018

dlstadther left a comment

Choose a reason for hiding this comment

thisiscab commented May 30, 2018

thisiscab commented Jun 6, 2018

dlstadther commented Jun 6, 2018

dlstadther commented Jun 6, 2018

thisiscab commented Jun 7, 2018 • edited Loading

dlstadther commented Jun 7, 2018

Tarrasch commented Jun 16, 2018

dlstadther commented Jul 9, 2018

thisiscab commented Jul 9, 2018

dlstadther commented Jul 9, 2018

thisiscab commented Jul 10, 2018

thisiscab commented Jul 20, 2018

dlstadther commented Jul 20, 2018

dlstadther commented Jul 20, 2018

thisiscab commented Aug 2, 2018

thisiscab commented Dec 17, 2018

dfeldstarsky commented Dec 17, 2018

dlstadther commented Dec 17, 2018

thisiscab commented Dec 17, 2018

thisiscab commented Jun 7, 2018 •

edited

Loading