Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve k8s dashboard #30913

Merged
merged 3 commits into from
Mar 22, 2022
Merged

Improve k8s dashboard #30913

merged 3 commits into from
Mar 22, 2022

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented Mar 18, 2022

What does this PR do?

This PR leverages elastic/kibana#126015 in order to solve elastic/integrations#2159

Why is it important?

At the moment the k8s Dashboard can be showing inconsistent values in the views for the "Desired", "Available" and "Unavailable" Pods. The reason is described at elastic/integrations#2159. With this chance we aim to improve the situation so as the views to be showing correct values by using the proper aggregations.

Related issues

Screenshots

Changing the time range of the dashboard was giving inconsistent values in the views. Below find attached the consistent views with different time ranges:
Screenshot 2022-03-18 at 12 13 17 PM
Screenshot 2022-03-18 at 12 13 13 PM
Screenshot 2022-03-18 at 12 13 07 PM

Screenshot 2022-03-18 at 12 09 41 PM

@ChrsMark ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 18, 2022
@ChrsMark ChrsMark self-assigned this Mar 18, 2022
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 18, 2022
@mergify
Copy link
Contributor

mergify bot commented Mar 18, 2022

This pull request does not have a backport label. Could you fix it @ChrsMark? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Mar 18, 2022
@ChrsMark ChrsMark added the backport-v8.2.0 Automated backport with mergify label Mar 18, 2022
@mergify mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Mar 18, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 18, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-22T14:02:28.594+0000

  • Duration: 55 min 44 sec

Test stats 🧪

Test Results
Failed 0
Passed 3519
Skipped 883
Total 4402

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark ChrsMark force-pushed the fix_k8s_dashboard branch from 180045c to fa1dab6 Compare March 21, 2022 12:20
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@@ -12,6 +12,7 @@
"params": {
"axis_formatter": "number",
"axis_position": "left",
"drop_last_bucket": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

@@ -40,47 +47,68 @@
"id": "2fe9d3b0-30d5-11e7-8df8-6d3604a72912",
"index_pattern": "metricbeat-*",
"interval": "auto",
"isModelInvalid": false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this field part of Panel Option? could you add a screenshot for the page where this is defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find it in the view tbh, @flash1293 do you have an answer handy for this?

@ChrsMark
Copy link
Member Author

@tetianakravchenko it seems that drop_last_bucket value is locked when choosing Entire time range:

Screenshot 2022-03-22 at 12 13 31 PM

Copy link
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just for curiosity would like to know where "isModelInvalid": false in UI is defined

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@flash1293
Copy link

@ChrsMark @tetianakravchenko @kvch I thought a bit more about this and I'm not 100% sure anymore whether users would expect the "entire time range" mode for this one. AFAIK in most beats dashboards the metric numbers are meant to show the "current state" of the system, but by using entire time range with the series agg and multi field group by, it will take into account "stale" information as well.

This is the case I'm a bit worried about:

  • User has multiple deployments in multiple namespaces running, for each of them a new document is indexed every n seconds
  • They look at the dashboard, timerange configured to 15 minutes
  • Metric on the dashboard is consistent, no matter the time range (because it's always ever taking into account the last value per time series)
  • User deletes an unhealthy deployment which caused a lot of "desired pods"
  • User refreshes the dashboard after a minute or so
  • They expect the "desired pods" metric to drop (because the deployment was deleted after all)
  • The number stays the same, it only drops after 15 minutes because even though no new documents for the deleted deployment were indexed, there was still a "last value" within the last 15 minutes which was rolled up into the metric

Another weird case:

  • They look at the dashboard, timerange configured to 15 minutes
  • They delete a deployment, come back to the dashboard after 30 minutes
  • Everything shows consistent, because the deleted deployment doesn't have a "last value" in the last 15 minutes
  • They increase the time range to two hours to check something
  • "Desired pods" metric goes up
  • User is confused - why are there more "desired pods" if I'm looking at more data in the past???

These cases could be improved by using "last value mode" with "drop last bucket" and an interval which is larger than the polling interval (let's say 2 minutes). By doing this, only the deployments which got documents indexed in the last 2 minutes will be rolled up into the final metric instead of all of them for the entire time range, resulting in a much faster reflection of K8s state changes at the expense of a more expensive query and some hidden magic (the 2 minutes are not visible to the user in any way while the entire time range is known).

I'm not sure about this, in the end it comes down to what the user expects this number is representing - the state of the system by the end of the configured time range or the state of the system "during" the whole time range. Your decision, just wanted to make this transparent.

@ChrsMark
Copy link
Member Author

ChrsMark commented Mar 24, 2022

Thanks for your feedback @flash1293, I think that the goal of these visualisations is to show the "current" state of the cluster no matter what is the time range. So I would say that using "last value mode" with "drop last bucket" is better option for this. I remember that during our chat you mentioned about this option having some drawbacks but I do not remember exactly what was it. If we agree that "last value mode" with "drop last bucket" is better for showing the current cluster status regardless of the time range then I can move on and open fixup PR.

I notice that you mention about setting the interval to 2 mins for example so as to have sth bigger than the polling interval but would be the result if a user put a bigger polling period?

@ChrsMark
Copy link
Member Author

I made some checks again and it seems to work correctly with the "last value mode" with "drop last bucket". So I'm +1 on changing this and document the detail about the interval so as users be able to update it in case they have bigger collection periods.

@flash1293
Copy link

flash1293 commented Mar 24, 2022

The downsides are:

  • More expensive data fetching (hard to tell how bad it is, y'all are more fit to have an opinion on this than me)
  • Magic constant "2 minutes" (or something like this) how long it takes to remove series that stopped reporting from the pool

@ChrsMark
Copy link
Member Author

ChrsMark commented Mar 24, 2022

@flash1293 if the collection period is 5 minutes a user also need to increase the interval too or that's not a problem anymore?

@flash1293
Copy link

@ChrsMark If the interval is smaller than the collection period it will be possible they wont see any data if they are unlucky.

@ChrsMark
Copy link
Member Author

@ChrsMark If the interval is smaller than the collection period it will be possible they wont see any data if they are unlucky.

Cool, I will have it documented then -> #30986

kush-elastic pushed a commit to kush-elastic/beats that referenced this pull request May 2, 2022
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.2.0 Automated backport with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants