-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Health Report API #16056
Comments
Really interesting and complete proposal. Have only one concern, the
As user I would expect that severity follow the natural order, so 1 is low and 10 is high. |
I too was surprised by this, but this is as-defined and implemented in Elasticsearch 's corresponding endpoint, and corresponds to the severity in support systems where a "Sev 1" is of utmost importance. The goal here is to follow their prior art wherever possible. |
This is a thoughtful proposal. I have two questions regarding the schema.
|
It really shows the amount of effort put into this, it's much appreciated. b) A concern I have is that the wider and more in depth the indicator tree grows, the more sensitive the top level status becomes to small perturbations. A suggestion is that we could allow the incubation of indicators. These would be visible in the health report but not yet bubble up their contribution to the parent indicator. A way to implement this would be to mark its impact to the c)
Not even behind a |
This is awesome, and I want to prioritize it as soon as possible. For Phase 1, do you have a high level sense of the amount of development time required? I think memory pressure is more similar to the ES indicator level than resources, and it will give us more options for properties of memory pressure. I understand the desire to have pipeline level information, but a more appropriate indicator at the LS health API level might be pipeline_flow with individual pipelines as properties and pipeline issues as properties. |
I was a little split on this, but ended up breaking
This is directly pulled from the Elasticsearch health report's schema, and the difference is still a little vague to me. In my mental model, a
I've added one, immediately below the schema.
I see the concern, and think we can handle this in two ways:
I hope that this is not needed, but I believe the implementation can defer it until it is needed. One way would be to have the
We can have many probes that contribute to degrading the status of the The Elasticsearch indicators map to either whole cluster-wide systems ( In shaping the top-level indicators to be |
This is awesome work, and I can see a lot of utility here, and some potential uses for our dashboards, and maybe even autoscaling in ECK.
|
Absolutely. I believe that since
Yes. Elasticsearch's comparable API uses Worth noting, Elaticsearch's version of this is also documented as a way of avoiding calculating the
This dove-tails into some of @jsvd's concerns, and I think we have a couple of options on the table that we can decide on if-and-when we get closer to seeing how it plays out.
I think that the probes themselves are an implementation detail, but that they should be able to produce a diagnosis that leads us back to the specifics. The Elasticsearch API's
Absolutely. The existing metrics API can be extremely verbose, and I am wary of continuing to add general noise to it without a plan to also deprecate some of the noise that is already there. I imagine these numbers becoming available as part of the |
As: a person responsible for ensuring Logstash pipelines are running without issue
I want: a health report similar to the one provided by Elasticsearch
So that: I can easily identify issues with the health of a Logstash process and its pipelines
Phase 1: API & Initial Indicators
This is a plan to deliver a
GET /_health_report
endpoint that is heavily inspired by the shape of Elasticsearch's endpoint of the same name, adhering to its prior art wherever possible.In Elasticsearch, the health report endpoint presents a flat collection of
indicators
and astatus
that reflects the status of the least-healthy indicator in the collection. Each indicator has its ownstatus
and optionalsymptom
, and may optionally provide information about theimpacts
of a less-than-healthy status, one or morediagnosis
with information about the cause, actions-to-remediate, and links to help, and/or details relevant to that particular indicator.Because many aspects of Logstash processing and the actionable insights required by operational administrators are pipeline-centric, we will introduce a top-level indicator called
pipelines
that will have a mapping of sub-indicators
, one for each pipeline. Like the top-level#/status
, the value of#/indicators/pipelines/status
will bubble up the least-healthy status of its component indicators.The Logstash agent will maintain a pipeline-indicator for each pipeline that the agent knows about, and each pipeline-indicator will have one or more probes that are capable of marking the indicator as unhealthy with a
diagnosis
and optionalimpacts
.Proposed Schema:
Click to expand Schema
Click to expand Example
Internally:
status
of any api response that includes it reflects the same value as one running theGET /_health_report
, includingGET /_node
andGET /_node_stats
.In the first stage we will introduce the
GET /_health_report
endpoint itself with the following indicators and probes:#/indicators/resources
resources:memory_pressure
:last_1_minute
window#/indicators/pipelines/indicators/<pipeline-id>
:pipelines:up
:Phase 2: Additional pipeline probes
In subsequent stages we will introduce additional probes to the pipeline indicators to allow them to diagnose the pipeline's behavior from its flow state. Each probe will feed off of the flow metrics for the pipeline, and will present pipeline-specific settings in the
pipeline.health.probe.<probe-name>
namespace for configuration.For example, a probe
queue_persisted_growth_events
that inspects thequeue_persisted_growth_events
flow metric would have default settings like:Or a
worker_utilization
probe that inspects theworker_utilization
flow metric to report issues if the workers are fully-utilized:Split options:
If making these probes configurable adds substantial delay, then we can ship them hard-coded with only the
enabled
option, and split the configurability off into a separate effort.Phase 3: Observing recovery in critical probes
With flow metrics, it is possible to differentiate active-critical situations from ones in active recovery. For example, a PQ having net-growth over the last 15 minutes may be a critical situation, but if we obseerve that we also have net-shrink over the last 5 minutes the situation isn't as dire, so it (a) shouldn't push the indicator into the red and (b) is capable of producing different diagnostic output.
At a future point we can add the concept of
recovery
to the flow-based probe prototype. When a probe tests positive forcritical
, we could also test itsrecovery
to present an appropriate result.The text was updated successfully, but these errors were encountered: