Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enterprise Search Stack Monitoring #114303

Merged
merged 66 commits into from
Dec 7, 2021
Merged

Enterprise Search Stack Monitoring #114303

merged 66 commits into from
Dec 7, 2021

Conversation

kovyrin
Copy link
Contributor

@kovyrin kovyrin commented Oct 7, 2021

Summary

This PR adds support for Enterprise Searc into the Stack Monitoring plugin. It relies on the new metricbeat module we're shipping in 7.16 (already merged + there is a PR to improve the metricsets) and that will be integrated into the Enterprise Search solution by default (running as a sidecar, controlled via the solution config).

The code in this PR is based primarily on the patterns and style of APM, Beats and Logstash modules and we tried to keep the changes extremely contained to avoid conflicts with any of the de-angularization work that is ongoing within the plugin. The overview page has been built with React (hence the react flag being enabled within the PR, we'll remove it before merging) to align with the new direction for the monitoring plugin.

Our team is planning to support and keep developing this code going forward and we're ready to make whatever changes necessary to align it with the status quo followed by other parts of Stack Monitoring. If any help is needed with testing of the changes, please let us know.

Event Structure

One thing of note in this PR is that Enterprise Search monitoring events fo not have a cluster_uuid field at the root level unlike all other events used by Stack Monitoring. Since the events are generated by metricbeat and elasticsearch metricbeat module has already added a cluster_uuid into the global schema as an alias for their field, we cannot use the same approach and we did not want to add the field to the global schema since it is not compatible with ECS. Instead, we had to change Stack monitoring logic for fetching time series to allow us to pass a flag to it to skip the implicit cluster_uuid filter applied to all queries. You can see the changes in get_metrics.ts and get_series.ts.

Feature Progress

  • Enterprise Search is present in the Stack Monitoring UI
    • Displays the metrics we want to ship for the initial release
    • Links to specific details pages
  • Basic metrics and stats fetching infrastructure is set up
  • React-based page created for the overview page:
    • Overview pane:
      • Total instances
      • Product usage metrics
    • Low-Level metrics pane:
      • HTTP metrics graphs
      • Memory usage graphs
      • Java threads graphs
    • Storage Metrics pane:
      • App Search Engines graph
      • Workplace Search Content Sources graph

Screenshots

Main page

localhost_5601_qml_app_monitoring

Enterprise Search Overview

localhost_5601_qml_app_monitoring (1)

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

Copy link
Contributor

@phillipb phillipb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Small tweaks.

@matschaffer matschaffer removed their request for review November 29, 2021 04:54
@matschaffer
Copy link
Contributor

Guessing if we address @phillipb 's concerns here we can merge this.

@elastic elastic deleted a comment from kibanamachine Dec 1, 2021
@JasonStoltz
Copy link
Member

@elasticmachine merge upstream

Copy link
Contributor

@phillipb phillipb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@phillipb
Copy link
Contributor

phillipb commented Dec 7, 2021

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
monitoring 439 445 +6

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
monitoring 436.6KB 445.5KB +8.9KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
monitoring 23.6KB 23.6KB +58.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @kovyrin

@kovyrin kovyrin merged commit 7929123 into main Dec 7, 2021
@kovyrin kovyrin deleted the kovyrin/ent-search-monitoring branch December 7, 2021 15:11
@kibanamachine
Copy link
Contributor

The following labels were identified as gaps in your version labels and will be added automatically:

  • v8.1.0

If any of these should not be on your pull request, please manually remove them.

kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Dec 7, 2021
* Added enterprise search panel, corrected queries

* Update the index pattern for Enterprise Search

* Typescript error ignore

* Our timestamp fields are called @timestamp (per ECS)

* Adjust Enterprise Search index patterns with the rest of monitoring plugin patterns (including CCS, etc)

* Initial implementation of the Enterprise Search overview panel (health only)

* Add a basic stub for enterprise search response fields

* Cleanup aggs configs

* Bring back a file deleted by mistake

* Started working on the overview page

* Correctly use heap_max as the total heap

* Ent search breadcrumbs

* Simple overview

* Allow the cluster_uuid filter to be skipped while fetching metrics

* Cleanup

* Switch to module-level uuid field and use both types of events

* Add stats-based product usage metrics + apply filter paths to reduce traffic

* Change the name of the ent search overview class

* Move the standalone cluster hack in the the internal function

* Change the overview page to show product usage metrics + introduce enterprise search stats in addition to metrics (they are fetched differently and allow us to reuse the stats code we have for the main page panel)

* Cluster UUID is at the module level now

* Simplify ent search pages structure, only have one overview page

* Fix ent search icon

* Add total instances

* Product usage metric graphs

* Simplify metrics loading in the overview page since we load all metrics anyways

* Add more enterprise search overview metrics

* Avoid duplicate labels

* linting

* Revert "Simplify metrics loading in the overview page since we load all metrics anyways"

This reverts commit 4bd67ab.

* Switch to multiple timeseries per graph

* Reorder graphs and metrics for better experience

* Typescript fixes

* i18n fixes

* Added a couple more JVM metrics

* Completely covered JVM metrics

* Convert Enterprise Search component to Typescript

* Switch config setting back

* Remove the nodes link since it raises more questions than it solves

* Update jest snapshots with the new metrics

* Remove console statement

* Properly handle cases when aggregations return no data for Enterprise Search

* Add a functional test for the Enterprise search cluster list panel

* Add a functional test for Enterprise Search overview page

* Update multicluster API response fixture with the new enterprise search response key

* Default uptime value is 0

* update overview fixture

* More fixture updates

* Remove fixmes

* Fix imports

* Properly export type

* Maybe fix the type checking error

* PR Feedback

* TS fixes

Co-authored-by: cdelgado <carlos.delgado@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Jason Stoltzfus <jason.stoltzfus@elastic.co>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Dec 7, 2021
* Added enterprise search panel, corrected queries

* Update the index pattern for Enterprise Search

* Typescript error ignore

* Our timestamp fields are called @timestamp (per ECS)

* Adjust Enterprise Search index patterns with the rest of monitoring plugin patterns (including CCS, etc)

* Initial implementation of the Enterprise Search overview panel (health only)

* Add a basic stub for enterprise search response fields

* Cleanup aggs configs

* Bring back a file deleted by mistake

* Started working on the overview page

* Correctly use heap_max as the total heap

* Ent search breadcrumbs

* Simple overview

* Allow the cluster_uuid filter to be skipped while fetching metrics

* Cleanup

* Switch to module-level uuid field and use both types of events

* Add stats-based product usage metrics + apply filter paths to reduce traffic

* Change the name of the ent search overview class

* Move the standalone cluster hack in the the internal function

* Change the overview page to show product usage metrics + introduce enterprise search stats in addition to metrics (they are fetched differently and allow us to reuse the stats code we have for the main page panel)

* Cluster UUID is at the module level now

* Simplify ent search pages structure, only have one overview page

* Fix ent search icon

* Add total instances

* Product usage metric graphs

* Simplify metrics loading in the overview page since we load all metrics anyways

* Add more enterprise search overview metrics

* Avoid duplicate labels

* linting

* Revert "Simplify metrics loading in the overview page since we load all metrics anyways"

This reverts commit 4bd67ab.

* Switch to multiple timeseries per graph

* Reorder graphs and metrics for better experience

* Typescript fixes

* i18n fixes

* Added a couple more JVM metrics

* Completely covered JVM metrics

* Convert Enterprise Search component to Typescript

* Switch config setting back

* Remove the nodes link since it raises more questions than it solves

* Update jest snapshots with the new metrics

* Remove console statement

* Properly handle cases when aggregations return no data for Enterprise Search

* Add a functional test for the Enterprise search cluster list panel

* Add a functional test for Enterprise Search overview page

* Update multicluster API response fixture with the new enterprise search response key

* Default uptime value is 0

* update overview fixture

* More fixture updates

* Remove fixmes

* Fix imports

* Properly export type

* Maybe fix the type checking error

* PR Feedback

* TS fixes

Co-authored-by: cdelgado <carlos.delgado@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Jason Stoltzfus <jason.stoltzfus@elastic.co>

Co-authored-by: Oleksiy Kovyrin <oleksiy@kovyrin.net>
Co-authored-by: cdelgado <carlos.delgado@elastic.co>
Co-authored-by: Jason Stoltzfus <jason.stoltzfus@elastic.co>
TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021
* Added enterprise search panel, corrected queries

* Update the index pattern for Enterprise Search

* Typescript error ignore

* Our timestamp fields are called @timestamp (per ECS)

* Adjust Enterprise Search index patterns with the rest of monitoring plugin patterns (including CCS, etc)

* Initial implementation of the Enterprise Search overview panel (health only)

* Add a basic stub for enterprise search response fields

* Cleanup aggs configs

* Bring back a file deleted by mistake

* Started working on the overview page

* Correctly use heap_max as the total heap

* Ent search breadcrumbs

* Simple overview

* Allow the cluster_uuid filter to be skipped while fetching metrics

* Cleanup

* Switch to module-level uuid field and use both types of events

* Add stats-based product usage metrics + apply filter paths to reduce traffic

* Change the name of the ent search overview class

* Move the standalone cluster hack in the the internal function

* Change the overview page to show product usage metrics + introduce enterprise search stats in addition to metrics (they are fetched differently and allow us to reuse the stats code we have for the main page panel)

* Cluster UUID is at the module level now

* Simplify ent search pages structure, only have one overview page

* Fix ent search icon

* Add total instances

* Product usage metric graphs

* Simplify metrics loading in the overview page since we load all metrics anyways

* Add more enterprise search overview metrics

* Avoid duplicate labels

* linting

* Revert "Simplify metrics loading in the overview page since we load all metrics anyways"

This reverts commit 4bd67ab.

* Switch to multiple timeseries per graph

* Reorder graphs and metrics for better experience

* Typescript fixes

* i18n fixes

* Added a couple more JVM metrics

* Completely covered JVM metrics

* Convert Enterprise Search component to Typescript

* Switch config setting back

* Remove the nodes link since it raises more questions than it solves

* Update jest snapshots with the new metrics

* Remove console statement

* Properly handle cases when aggregations return no data for Enterprise Search

* Add a functional test for the Enterprise search cluster list panel

* Add a functional test for Enterprise Search overview page

* Update multicluster API response fixture with the new enterprise search response key

* Default uptime value is 0

* update overview fixture

* More fixture updates

* Remove fixmes

* Fix imports

* Properly export type

* Maybe fix the type checking error

* PR Feedback

* TS fixes

Co-authored-by: cdelgado <carlos.delgado@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Jason Stoltzfus <jason.stoltzfus@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Stack Monitoring release_note:enhancement release_note:feature Makes this part of the condensed release notes v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants