Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Telemetry Controller Under Test #307

Open
vlerenc opened this issue Sep 5, 2020 · 2 comments
Open

Use Telemetry Controller Under Test #307

vlerenc opened this issue Sep 5, 2020 · 2 comments
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) topology/shoot Affects Shoot clusters

Comments

@vlerenc
Copy link
Member

vlerenc commented Sep 5, 2020

What would you like to be added:
@dkistner has implemented a "telemetry controller" that keeps track of the control plane availability. It would make sense to have it observing the state of clusters under reconciliation/maintenance/test and report this metric to alert about poor shoot cluster control plane availability and eventually break the release/transport if KPIs are not met. Or shall this be part of the specific Gardener tests instead?

Why is this needed:
We sometimes miss issues here and lack repeating test results of this most important metric (it is the only metric relevant in our SLO).

@vlerenc vlerenc added area/monitoring Monitoring (including availability monitoring and alerting) related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension topology/shoot Affects Shoot clusters component/tm Test machinery (tooling and processes) labels Sep 5, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 5, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 22, 2021
@dguendisch
Copy link
Member

@schrodit do you recall what was the issue with the telemetry controller when you tried it a long time ago?

@schrodit
Copy link

schrodit commented Nov 2, 2021

If I remember correctly we tested it on dev for some time. also with persisted metrics in elastic search.
But there were 2 issues

  1. The metric was for a single test or a bunch of tests noch really usefull. This is because we only create a cluster test some stuff and then delete a cluster which did not give usefull metrics. We only had one useful test where the k8s version upgrade is tested but even there no one really had a look at the metrics.
  2. The more useful metric would be to have it running during a complete gardener update. But this was currently not possible with the current concourse + testmachinery implementation as we would need to start the testmachinery before the actual deployment started and end it after all shoots(or most of them) are reconciled.

@vlerenc vlerenc removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) component/tm Test machinery (tooling and processes) labels Apr 20, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) topology/shoot Affects Shoot clusters
Projects
None yet
Development

No branches or pull requests

4 participants