Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to add 2 metrics to external-dns. #2338

Merged
merged 1 commit into from
Oct 28, 2021

Conversation

adityask2
Copy link
Contributor

@adityask2 adityask2 commented Sep 30, 2021

Added new metric for external-dns -

  1. external_dns_controller_verified_records - No of DNS A-records that exists both in source and registry - Gauge Metric

Description

We have identified a problem with the metrics published by external-dns

Registry endpoints total - Based on the DNS provider, the count for each DNS name can change. For instance, CoreDNS provider publishes both count of A-Record and TXT-Record for a given dns name, resulting in 2 entries per dns name

external-dns-controller-verified-records - This metric displays the count of DNS records that are verified between source and registry in external dns i.e. check for every DNS entry in external-dns check the corresponding entry exists in registry.

Checklist

  • Unit tests updated
  • End user documentation updated

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @adityask2!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 30, 2021
@adityask2
Copy link
Contributor Author

Adding @njuettner for review.

@Raffo
Copy link
Contributor

Raffo commented Oct 6, 2021

Registry endpoints total - Based on the DNS provider, the count for each DNS name can change. For instance, CoreDNS provider publishes both count of A-Record and TXT-Record for a given dns name, resulting in 2 entries per dns name
Registry/Source errors total - this is an incremental counter of the number of errors in reading/registering the DNS names. This metric does not tell if there is an error in external-dns at a given point in time

I think you can monitor the difference and alert on that if you want to have a signal of an error, you don't need a sort of "boolean" to see if there is an error.

external-dns-controller-verified-records - This metric displays the count of DNS records that are verified between source and registry in external dns i.e. check for every DNS entry in external-dns check the corresponding entry exists in registry.

Can you tell us in which situation such a metric would be useful? What value is adding?

@adityask2
Copy link
Contributor Author

adityask2 commented Oct 6, 2021

Registry endpoints total - Based on the DNS provider, the count for each DNS name can change. For instance, CoreDNS provider publishes both count of A-Record and TXT-Record for a given dns name, resulting in 2 entries per dns name
Registry/Source errors total - this is an incremental counter of the number of errors in reading/registering the DNS names. This metric does not tell if there is an error in external-dns at a given point in time

I think you can monitor the difference and alert on that if you want to have a signal of an error, you don't need a sort of "boolean" to see if there is an error.

Consider this scenario - lets say there were 3 errors before and the system recovered. If I start monitoring now I'm going to see the error count as 3 there is no way of knowing if the system is still in error state because of which I'm seeing the count 3 or has the system recovered. At any point in time i would want to know if my system is currently in error state or no. To determine that based on past events doesn't seem right

external-dns-controller-verified-records - This metric displays the count of DNS records that are verified between source and registry in external dns i.e. check for every DNS entry in external-dns check the corresponding entry exists in registry.

Can you tell us in which situation such a metric would be useful? What value is adding?

We need to know the exact number of A records in external-dns that are synced with the DNS provider. We have observed that the number of registry records is implemented differently by each DNS provider plugin. For instance coredns plugin adds both A-records and TXT records into the list of registry entries while infloblox add 3 records.
This results in each record of source having two records in registry in case of coredns and three records in case of infloblox. To avoid this confusion we want to compare the number of A-records in source to number of A-records in registry. we achieve this using the metric

@seanmalloy
Copy link
Member

/cc

@LappleApple
Copy link
Contributor

Heya @Raffo and @njuettner, wondering if you have thoughts on @adityask2's comment above and how we might be able to move this forward. Thanks!

@Raffo
Copy link
Contributor

Raffo commented Oct 25, 2021

Yeah sorry that has slipped through the cracks, we have quite the backlog and none of the time to go through it without any funding.

Anyway, me and @njuettner chatted briefly about this.

Consider this scenario - lets say there were 3 errors before and the system recovered. If I start monitoring now I'm going to see the error count as 3 there is no way of knowing if the system is still in error state because of which I'm seeing the count 3 or has the system recovered. At any point in time i would want to know if my system is currently in error state or no. To determine that based on past events doesn't seem right

I disagree with the need for a "boolean". I think that most metric systems would allow you the extract that information from existing metrics and the "boolean" logic is really an implementation detail. But I might be wrong! Do you have an example of a system in the kubernetes ecosystem where something like this is implemented?

We need to know the exact number of A records in external-dns that are synced with the DNS provider. We have observed that the number of registry records is implemented differently by each DNS provider plugin. For instance coredns plugin adds both A-records and TXT records into the list of registry entries while infloblox add 3 records.
This results in each record of source having two records in registry in case of coredns and three records in case of infloblox. To avoid this confusion we want to compare the number of A-records in source to number of A-records in registry. we achieve this using the metric

this makes total sense to me, thanks for clarifying.

@adityask2
Copy link
Contributor Author

@Raffo Thnx for the review.

Unfortunately, I don't have the details at the present moment on the existing system having a similar implementation with respect to external_dns_source_errors_total. We can take it as part of different PR.

Meanwhile, I will modify the PR to contain the external_dns_controller_verified_records metric.

Copy link
Contributor

@Raffo Raffo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityask2 I added a few comments. It would also be nice if you could add more tests for when the metrics are showing a problem, but I'm not sure that would be too hard.

controller/controller.go Outdated Show resolved Hide resolved
controller/controller.go Outdated Show resolved Hide resolved
controller/controller.go Outdated Show resolved Hide resolved
controller/controller.go Outdated Show resolved Hide resolved
controller/controller.go Outdated Show resolved Hide resolved
controller/controller.go Outdated Show resolved Hide resolved
1. external_dns_controller_verified_records - No of DNS A-records that exists both in source and registry - Gauge Metric
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 27, 2021
@adityask2
Copy link
Contributor Author

@adityask2 I added a few comments. It would also be nice if you could add more tests for when the metrics are showing a problem, but I'm not sure that would be too hard.

@Raffo Thnx for the quick review. Very much appreciated. :)
I have addressed all the comments. I have added few extra test cases, covering the scenarios where the registry and endpoint records do not match.

@Raffo
Copy link
Contributor

Raffo commented Oct 28, 2021

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adityask2, Raffo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2021
@k8s-ci-robot k8s-ci-robot merged commit 4c66715 into kubernetes-sigs:master Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants