Improve k8s dashboard #30913

ChrsMark · 2022-03-18T12:42:32Z

What does this PR do?

This PR leverages elastic/kibana#126015 in order to solve elastic/integrations#2159

Why is it important?

At the moment the k8s Dashboard can be showing inconsistent values in the views for the "Desired", "Available" and "Unavailable" Pods. The reason is described at elastic/integrations#2159. With this chance we aim to improve the situation so as the views to be showing correct values by using the proper aggregations.

Related issues

Relates Kubernetes overview dashboard: metric visualization displays incorrect data integrations#2159

Screenshots

Changing the time range of the dashboard was giving inconsistent values in the views. Below find attached the consistent views with different time ranges:

mergify · 2022-03-18T12:42:42Z

This pull request does not have a backport label. Could you fix it @ChrsMark? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2022-03-18T12:57:15Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-22T14:02:28.594+0000
Duration: 55 min 44 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3519
Skipped	883
Total	4402

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Signed-off-by: chrismark <chrismarkou92@gmail.com>

metricbeat/module/kubernetes/_meta/kibana/7/dashboard/e0381d10-e4a6-11eb-9d53-3b3d1d47c519.json

tetianakravchenko · 2022-03-22T09:31:39Z

...module/kubernetes/_meta/kibana/7/visualization/022a54c0-2bf5-11e7-859b-f78b612cde28-ecs.json

@@ -12,6 +12,7 @@
            "params": {
                "axis_formatter": "number",
                "axis_position": "left",
+                "drop_last_bucket": 1,


why is it needed?

tetianakravchenko · 2022-03-22T10:01:27Z

...module/kubernetes/_meta/kibana/7/visualization/da1ff7c0-30ed-11e7-b9e5-2b5b07213ab3-ecs.json

@@ -40,47 +47,68 @@
                "id": "2fe9d3b0-30d5-11e7-8df8-6d3604a72912",
                "index_pattern": "metricbeat-*",
                "interval": "auto",
+                "isModelInvalid": false,


is this field part of Panel Option? could you add a screenshot for the page where this is defined?

I cannot find it in the view tbh, @flash1293 do you have an answer handy for this?

ChrsMark · 2022-03-22T12:16:06Z

@tetianakravchenko it seems that drop_last_bucket value is locked when choosing Entire time range:

tetianakravchenko

LGTM, just for curiosity would like to know where "isModelInvalid": false in UI is defined

Signed-off-by: chrismark <chrismarkou92@gmail.com>

flash1293 · 2022-03-24T08:36:07Z

@ChrsMark @tetianakravchenko @kvch I thought a bit more about this and I'm not 100% sure anymore whether users would expect the "entire time range" mode for this one. AFAIK in most beats dashboards the metric numbers are meant to show the "current state" of the system, but by using entire time range with the series agg and multi field group by, it will take into account "stale" information as well.

This is the case I'm a bit worried about:

User has multiple deployments in multiple namespaces running, for each of them a new document is indexed every n seconds
They look at the dashboard, timerange configured to 15 minutes
Metric on the dashboard is consistent, no matter the time range (because it's always ever taking into account the last value per time series)
User deletes an unhealthy deployment which caused a lot of "desired pods"
User refreshes the dashboard after a minute or so
They expect the "desired pods" metric to drop (because the deployment was deleted after all)
The number stays the same, it only drops after 15 minutes because even though no new documents for the deleted deployment were indexed, there was still a "last value" within the last 15 minutes which was rolled up into the metric

Another weird case:

They look at the dashboard, timerange configured to 15 minutes
They delete a deployment, come back to the dashboard after 30 minutes
Everything shows consistent, because the deleted deployment doesn't have a "last value" in the last 15 minutes
They increase the time range to two hours to check something
"Desired pods" metric goes up
User is confused - why are there more "desired pods" if I'm looking at more data in the past???

These cases could be improved by using "last value mode" with "drop last bucket" and an interval which is larger than the polling interval (let's say 2 minutes). By doing this, only the deployments which got documents indexed in the last 2 minutes will be rolled up into the final metric instead of all of them for the entire time range, resulting in a much faster reflection of K8s state changes at the expense of a more expensive query and some hidden magic (the 2 minutes are not visible to the user in any way while the entire time range is known).

I'm not sure about this, in the end it comes down to what the user expects this number is representing - the state of the system by the end of the configured time range or the state of the system "during" the whole time range. Your decision, just wanted to make this transparent.

ChrsMark · 2022-03-24T09:11:17Z

Thanks for your feedback @flash1293, I think that the goal of these visualisations is to show the "current" state of the cluster no matter what is the time range. So I would say that using "last value mode" with "drop last bucket" is better option for this. I remember that during our chat you mentioned about this option having some drawbacks but I do not remember exactly what was it. If we agree that "last value mode" with "drop last bucket" is better for showing the current cluster status regardless of the time range then I can move on and open fixup PR.

I notice that you mention about setting the interval to 2 mins for example so as to have sth bigger than the polling interval but would be the result if a user put a bigger polling period?

ChrsMark · 2022-03-24T10:12:09Z

I made some checks again and it seems to work correctly with the "last value mode" with "drop last bucket". So I'm +1 on changing this and document the detail about the interval so as users be able to update it in case they have bigger collection periods.

flash1293 · 2022-03-24T10:32:10Z

The downsides are:

More expensive data fetching (hard to tell how bad it is, y'all are more fit to have an opinion on this than me)
Magic constant "2 minutes" (or something like this) how long it takes to remove series that stopped reporting from the pool

ChrsMark · 2022-03-24T11:43:58Z

@flash1293 if the collection period is 5 minutes a user also need to increase the interval too or that's not a problem anymore?

flash1293 · 2022-03-24T11:50:22Z

@ChrsMark If the interval is smaller than the collection period it will be possible they wont see any data if they are unlucky.

ChrsMark · 2022-03-24T12:04:07Z

@ChrsMark If the interval is smaller than the collection period it will be possible they wont see any data if they are unlucky.

Cool, I will have it documented then -> #30986

ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 18, 2022

ChrsMark self-assigned this Mar 18, 2022

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 18, 2022

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Mar 18, 2022

ChrsMark added the backport-v8.2.0 Automated backport with mergify label Mar 18, 2022

mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Mar 18, 2022

ChrsMark mentioned this pull request Mar 21, 2022

Fix export dashboards scripts #30929

Merged

Improve k8s dashboard

fa1dab6

Signed-off-by: chrismark <chrismarkou92@gmail.com>

ChrsMark force-pushed the fix_k8s_dashboard branch from 180045c to fa1dab6 Compare March 21, 2022 12:20

Remove snapshot versions

dc28ed0

Signed-off-by: chrismark <chrismarkou92@gmail.com>

ChrsMark requested review from kvch and tetianakravchenko March 21, 2022 13:29

tetianakravchenko reviewed Mar 22, 2022

View reviewed changes

ChrsMark requested review from flash1293 and tetianakravchenko March 22, 2022 12:16

tetianakravchenko approved these changes Mar 22, 2022

View reviewed changes

Add changelog

4f9564a

Signed-off-by: chrismark <chrismarkou92@gmail.com>

ChrsMark merged commit 684fe3c into elastic:main Mar 22, 2022

This was referenced Mar 22, 2022

Kubernetes overview dashboard: metric visualization displays incorrect data elastic/integrations#2159

Closed

Fix k8s overview dashboard elastic/integrations#2876

Merged

ChrsMark mentioned this pull request Mar 24, 2022

Fix k8s dashboard to use proper timerange #30986

Merged

kush-elastic pushed a commit to kush-elastic/beats that referenced this pull request May 2, 2022

Improve k8s dashboard (elastic#30913)

448f28a

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Improve k8s dashboard (#30913)

fd35a5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve k8s dashboard #30913

Improve k8s dashboard #30913

ChrsMark commented Mar 18, 2022 •

edited

Loading

mergify bot commented Mar 18, 2022

elasticmachine commented Mar 18, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

tetianakravchenko Mar 22, 2022

tetianakravchenko Mar 22, 2022

ChrsMark Mar 22, 2022

ChrsMark commented Mar 22, 2022

tetianakravchenko left a comment

flash1293 commented Mar 24, 2022

ChrsMark commented Mar 24, 2022 •

edited

Loading

ChrsMark commented Mar 24, 2022

flash1293 commented Mar 24, 2022 •

edited

Loading

ChrsMark commented Mar 24, 2022 •

edited

Loading

flash1293 commented Mar 24, 2022

ChrsMark commented Mar 24, 2022

Improve k8s dashboard #30913

Improve k8s dashboard #30913

Conversation

ChrsMark commented Mar 18, 2022 • edited Loading

What does this PR do?

Why is it important?

Related issues

Screenshots

mergify bot commented Mar 18, 2022

elasticmachine commented Mar 18, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

tetianakravchenko Mar 22, 2022

Choose a reason for hiding this comment

tetianakravchenko Mar 22, 2022

Choose a reason for hiding this comment

ChrsMark Mar 22, 2022

Choose a reason for hiding this comment

ChrsMark commented Mar 22, 2022

tetianakravchenko left a comment

Choose a reason for hiding this comment

flash1293 commented Mar 24, 2022

ChrsMark commented Mar 24, 2022 • edited Loading

ChrsMark commented Mar 24, 2022

flash1293 commented Mar 24, 2022 • edited Loading

ChrsMark commented Mar 24, 2022 • edited Loading

flash1293 commented Mar 24, 2022

ChrsMark commented Mar 24, 2022

ChrsMark commented Mar 18, 2022 •

edited

Loading

elasticmachine commented Mar 18, 2022 •

edited by jenkins-beats-ci bot

Loading

ChrsMark commented Mar 24, 2022 •

edited

Loading

flash1293 commented Mar 24, 2022 •

edited

Loading

ChrsMark commented Mar 24, 2022 •

edited

Loading