[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

ymao1 · 2022-09-20T17:35:16Z

The alerting framework plans to start persisting alerts-as-data at a framework level using a single .alerts-default index. This index will need to be created on startup with an ILM policy, component templates and aliases. We will also need the ability to update the index mappings with each version. We are currently doing similar things with the event log index and the rule registry alert indices, both of which have run into various issues, large and small. We should consolidate some of our learnings from installing those indices into best practices in order to apply them to the new .alerts-default index.

Related issues and PRs:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-09-20T17:35:18Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-09-22T16:13:37Z

Linking with #111152.

ymao1 · 2022-10-12T13:37:27Z

Resources that need to be installed for framework alerts-as-data (FAAD??)

ILM policy
Component templates
Index template
Concrete write index

Summary of existing resource installation behavior

ILM Policy

Both the Event Log and Rule Registry plugins use a pretty standard ILM policy with a hot phase that rolls over at 50GB or 30 days. Event Log policy dictates a delete after 90 days and Rule Registry has no deletion phase. For framework alerts-as-data, we will likely want to keep the data around indefinitely as well. We should ensure we add _meta.managed:true to the policy to notify users that this is a managed policy and shouldn't be tinkered with.

Component templates

Event log does not use component templates. Rule registry installs 2 common component templates at plugin setup (ECS fields and the "technical field map") and then installs solution specific component templates as needed (on write). Framework alerts as data should use component templates and it makes sense to use two, one for ECS (that is auto-generated from the latest ECS fieldset) and one for the default alerts-as-data schema.

Index templates

Event Log
Event log creates a new index template with every Kibana version upgrade (version included in template name). This allows for index mapping updates between versions but there is no check to ensure that fields mappings remain compatible between versions. If ever there were mapping clashes between versions, queries could be written to take the version into account.

Rule Registry
Rule registry installs solution specific index templates as needed (on write) that include solution + namespace in the template name. Security creates an index template per namespace while Observability only creates a single template. The current Kibana version is written to the index template metadata. Only additive mapping updates are allowed to the index templates. To verify that a template update is compatible, the template is simulated first. If there are errors during the simulation (possible due to non-additive changes to the mapping), an error is thrown and rule registry writes are disabled for that index

Concrete write indices

Event Log
Event log creates a concrete write index for the versioned index template (version included in the index name).

Rule Registry
Rule registry creates a concrete write index as needed (on write) that includes solution + namespace in the index name. When index mappings are updated between versions, the mappings for all concrete indices (as retrieved via alias) are also updated (provided there are no errors during template simulation. If there are errors during template simulation (as can happen when partial snapshot indices exist), the mapping is not updated for the concrete index.

Suggested Framework Alerts-as-Data resource installation behavior

ILM Policy

ILM policy should follow the rule registry with a hot phase and no delete phase. Ensure that _meta.managed:true to surface managed policy warning to users and dissuade them from modifying the policy.

{
  _meta: {
    managed: true,
  },
  phases: {
    hot: {
      actions: {
        rollover: {
          max_age: '30d',
          max_primary_shard_size: '50gb',
        },
      },
    },
  },
}

Component templates

Framework alerts as data should only need two component templates: ECS component template and an alerts-as-data component template. The ECS component template should be auto-generated from ECS with a script, similar to what the event log does, except we need to pull over all ECS fields. When creating an index template composed of these two component templates, we should ensure the ECS component template is last to ensure we are using the official ECS mappings, just in case we define a field in the alerts-as-data component template with the same name.

Index template and concrete write index

We should ensure these settings are in the index template:

hidden: true
auto_expand_replicas: '0-1'

Our index strategy will depend on whether we want to allow non-additive schema changes between versions:

Only additive changes
If we decide that only additive changes will be allowed to the schema, we can re-use the rule registry strategy where mappings in the index template are updated and the mappings of every backing index are updated on schema change. Because we would be installing our resources at plugin setup vs on-demand, it should be easier to catch errors (mapping conflicts, not-allowed mapping changes) during development. We would not need to rollover the backing indices manually since we are directly updating the concrete index mappings so the ILM policy would handle rolling over at 30 days of 50 GB.

Non-additive changes allowed
Because we're reducing the complexity of the index template and index initialization by allowing just a single alerts-as-data index and because the framework will be in charge of the schema (vs the decentralized schema generation of the rule registry), we should have more freedom to allow non-additive schema changes to the index mapping between version upgrades. Although we are not using data streams because we want mutable documents, we can follow the general datastream guidance when it comes to mapping updates, which is: If you need to change the mapping of an existing field, create a new data stream and reindex your data into it. I think we can use this guidance to make the framework alerts-as-data paradigm more similar to the Event Log where we create a new index template and concrete write index per version. This would allow us to make non-additive schema changes between versions with the ability to target queries to specific versions if necessary while still querying across all alerts as data indices. This strategy does have the downside of creating more indices, one for each version, which, as discussed elsewhere, with 1000+ fields mapped puts more pressure on the heap. We have also seen with the event log this issue where the ILM policy continues to rollover empty indices although that seems to have been addressed recently by this PR. We should consider this additional impact to heap size when deciding whether to allow non-additive schema changes.

Retry

We have seen via various SDHs issues that crop up when the expected resources are not installed as expected due to external (usually ES related) errors. The rule registry currently doesn't contain much retry logic but we've recently added retry logic to event log initialization that we should look into reusing. We can also look at the retry logic used by the Fleet plugin which specifically looks for transient ES errors.

mikecote · 2022-10-14T12:19:26Z

Great research! There's a lot learned that we'll be able to re-use for the FAAD indices!

If we decide that only additive changes will be allowed to the schema, we can re-use the rule registry strategy where mappings in the index template are updated and the mappings of every backing index are updated on schema change.

Yeah, I could see this approach being beneficial for the scenarios we mutate alerts as data documents. For example, if we add new workflow timestamps (#141464), the update operation could fail if a document gets updated and the corresponding index doesn't have the latest mappings (assuming we're in strict mode).

ymao1 · 2022-10-17T16:38:01Z

Closing as research as complete and can be referenced when we implement index bootstrapping for framework alerts as data.

ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) research labels Sep 20, 2022

ymao1 added this to AppEx: ResponseOps - Execution & Connectors Sep 20, 2022

ymao1 moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Sep 20, 2022

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Sep 20, 2022

mikecote mentioned this issue Sep 22, 2022

[Rule Registry] Default index settings and ILM policy for all indices #111152

Closed

ymao1 self-assigned this Oct 4, 2022

ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Oct 4, 2022

ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Oct 12, 2022

ymao1 closed this as completed Oct 17, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Oct 17, 2022

ymao1 mentioned this issue Nov 14, 2022

[Response Ops][Alerting] Install framework alerts-as-data resources on startup #145100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

ymao1 commented Sep 20, 2022 •

edited

Loading

elasticmachine commented Sep 20, 2022

mikecote commented Sep 22, 2022

ymao1 commented Oct 12, 2022

mikecote commented Oct 14, 2022 •

edited

Loading

ymao1 commented Oct 17, 2022

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

Comments

ymao1 commented Sep 20, 2022 • edited Loading

elasticmachine commented Sep 20, 2022

mikecote commented Sep 22, 2022

ymao1 commented Oct 12, 2022

Summary of existing resource installation behavior

ILM Policy

Component templates

Index templates

Concrete write indices

Suggested Framework Alerts-as-Data resource installation behavior

ILM Policy

Component templates

Index template and concrete write index

Retry

mikecote commented Oct 14, 2022 • edited Loading

ymao1 commented Oct 17, 2022

ymao1 commented Sep 20, 2022 •

edited

Loading

mikecote commented Oct 14, 2022 •

edited

Loading