Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Integration (#3407) #3619

Conversation

natemollica-nm
Copy link
Contributor

Backport

This cherry-picked PR has been manually generated from #3407 to be assessed for backporting as automatic cherry-picking using the label failed.

The below text is copied from the body of the original PR.


Changes proposed in this PR

PR Resubmitted from Branch vice Fork for Automation Testing

  • Expose several metrics specific features on consul-k8s to include:
  • Introduce a means to ease the integration and operation of integrating with Datadog Agent metrics collection via fail-safe helm override value parameters. Overrides are intended to allow operators to easily configure 1 of the 3 following methods of monitoring Consul with Datadog on Kubernetes:
    • DogStatsD via one of either "UDP" or "UDS" transport protocols
    • OpenMetrics via Datadogs Autodiscovery feature to scrape the /v1/agent/metrics?format=prometheus endpoint
    • Datadog + Consul Integration Feature standard checks:
      • Serf events and member flaps
      • The Raft protocol
      • DNS performance
      • API Endpoints Health Checks:
        • /v1/agent/metrics?format=prometheus
        • /v1/agent/self
        • /v1/status/leader
        • /v1/status/peers
        • /v1/catalog/services
        • /v1/health/service
        • /v1/health/state/any
        • /v1/coordinate/datacenters
        • /v1/coordinate/nodes
  • Introduces server-acl-init token creation for OpenMetrics and Datadog Consul Integration check methods allowing default minimal acl token permission generation for Datadog agent usage as necessary.

How I've tested this PR

  • New ACL Token Testing as outline in the CONTRIBUTING.md steps.
  • Deployment and testing of local consul-dev (main) and consul-k8s-control-plane-dev (datadog-integration branch) images on k3d test cluster for each scenario. Test repository here.
  • Verification of helm templating for new value overrides added as instructed in CONTRIBUTING.md steps. bats ./charts/consul/test/unit --jobs 8 - ran successfully for all tests.

How I expect reviewers to test this PR

  • Assess the need for additional unit testing creation and verification.
  • If possible:
    • Reach out with any question/concerns or reasons for PR push-back.
    • Verification of fail-safe interlocks between the 3 methods of integration mentioned above.
    • Verification of ACL policy implementation.

Checklist


Overview of commits

* datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation, deployment override failsafes

* datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation | final initial-push

* changelog entry update

* datadog-integration: updated consul-server agent server.config (enable_debug) and telemetry.config update | enable_debug to server.config

* curt pr review changes (minus extraConfig templating verification changes)

* global.metrics.AgentMetrics -> global.metrics.enableAgentMetrics

* dogstatsd and otlp mutually exclusive verification checks

* breaking changes now incorporated into consul.validateExtraConfig helper template function as precheck

* extraConfig hash updates post merge conflict update

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* update changelog .txt to match new PR number

* updated server-statefulset.yaml to correct ad.datadoghq.com/consul.logs annotation to valid single quote string

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* update UDP dogstatsdPort behavior to exclude including a port value if using a kube service address (as determined by user overrides)

* update _helpers.tpl consul.ValidateDatadogConfiguration func to account for using 'https' as protocol => should fail

* update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul

* update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul

* correct otlp protocol helpers.tpl check to lower-case the protocol to match the open-telemetry-deployment.yaml behavior

* fix server-acl-init command_test.go for datadog token policy - datacenter should have been dc1

* add in server-statefulset bats test for extraConfig validation testing
@natemollica-nm natemollica-nm self-assigned this Feb 12, 2024
@natemollica-nm natemollica-nm added the pr/no-backport signals that a PR will not contain a backport label label Feb 12, 2024
Copy link
Contributor

@curtbushko curtbushko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nate! It looks like changes from #3000 have leaked into your PR. Those should be removed as we thought they would be too disruptive for upgrades.

@@ -56,7 +57,12 @@ data:
"enabled": true
},
{{- end }}
"server": true
"server": true,
"leave_on_terminate": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: These extra 'leave_on_terminate' and 'autopilot' settings should be removed as they were deemed destructive.

We need to check the other backports as anything from #3000 should not be in release/1.3.x, release/1.2.x and release/1.1.x (1.4.x is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as recommended by reverting back to release/1.3.x branch version of affected files.

$ git checkout 'release/1.3.x' -- charts/consul/templates/server-config-configmap.yaml

Re-applied datadog-integration changes into the following files:

  • charts/consul/templates/server-config-configmap.yaml
    • Reincorporated enable_debug into server.json (updates server-statefulset.yaml config-checksum)
    • Reapplied all datadog and agent metric-related entries into the telemetry-config.json
  • charts/consul/test/unit/server-statefulset.bats
    • Updated config-configmap tests to reflect enable_debug update to server.json config
      • "server/StatefulSet: adds config-checksum annotation when extraConfig is blank"
      • "server/StatefulSet: adds config-checksum annotation when extraConfig is provided"
      • "server/StatefulSet: adds config-checksum annotation when extraConfig is updated"

@@ -17,7 +17,7 @@ metadata:
release: {{ .Release.Name }}
component: server
spec:
maxUnavailable: {{ template "consul.pdb.maxUnavailable" . }}
maxUnavailable: {{ template "consul.server.pdb.maxUnavailable" . }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This is also from #3000 and should be dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as recommended by reverting back to release/1.3.x branch version of affected files.

$ git checkout 'release/1.3.x' -- charts/consul/templates/server-disruptionbudget.yaml charts/consul/test/unit/server-disruptionbudget.bats charts/consul/template/_helpers.tpl

Applied datadog-integration changes back into _helpers.tpl

Re-ran entirety of bats tests using Makefile - make bats-tests (all passed)

*/}}
{{- define "consul.pdb.maxUnavailable" -}}
{{- define "consul.server.pdb.maxUnavailable" -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: A bunch of changes from #3000 in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as recommended by reverting back to release/1.3.x branch version of affected files.

$ git checkout 'release/1.3.x' -- charts/consul/templates/server-disruptionbudget.yaml charts/consul/test/unit/server-disruptionbudget.bats charts/consul/template/_helpers.tpl

Applied datadog-integration changes back into _helpers.tpl

Re-ran entirety of bats tests using Makefile - make bats-tests (all passed)

@@ -348,7 +348,7 @@ load _helpers
[[ "$output" =~ "When the value global.experiments.resourceAPIs is set, global.peering.enabled is currently unsupported." ]]
}

@test "connectInject/Deployment: fails if resource-apis is set and admin partitions are enabled" {
@test "connectInject/Deployment: fails if resource-apis is set, v2tenancy is unset, and admin partitions are enabled" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Looks like extra stuff picked up. git checkout 'release/1.3.x' helpers.bats will allow you to reset the file to the branch it is from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as recommended by reverting back to release/1.3.x branch version of affected files.

$ git checkout 'release/1.3.x' -- charts/consul/templates/server-disruptionbudget.yaml charts/consul/test/unit/server-disruptionbudget.bats charts/consul/template/_helpers.tpl

Applied datadog-integration changes back into _helpers.tpl

Re-ran entirety of bats tests using Makefile - make bats-tests (all passed)

@@ -97,7 +97,7 @@ load _helpers
--set 'server.replicas=6' \
. | tee /dev/stderr |
yq '.spec.maxUnavailable' | tee /dev/stderr)
[ "${actual}" = "2" ]
[ "${actual}" = "1" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This file too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as recommended by reverting back to release/1.3.x branch version of affected files.

$ git checkout 'release/1.3.x' -- charts/consul/templates/server-disruptionbudget.yaml charts/consul/test/unit/server-disruptionbudget.bats charts/consul/template/_helpers.tpl

Applied datadog-integration changes back into _helpers.tpl

Re-ran entirety of bats tests using Makefile - make bats-tests (all passed)

… re-apply datadog-integration branch changes
Copy link
Contributor

@curtbushko curtbushko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@natemollica-nm natemollica-nm merged commit 1283c12 into release/1.3.x Feb 13, 2024
46 checks passed
@natemollica-nm natemollica-nm deleted the backport/natemollica-nm/datadog-integration/manual-cherry-pick branch February 13, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport signals that a PR will not contain a backport label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants