Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

Closed
ymao1 opened this issue Sep 20, 2022 · 5 comments
Assignees
Labels
Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Sep 20, 2022

The alerting framework plans to start persisting alerts-as-data at a framework level using a single .alerts-default index. This index will need to be created on startup with an ILM policy, component templates and aliases. We will also need the ability to update the index mappings with each version. We are currently doing similar things with the event log index and the rule registry alert indices, both of which have run into various issues, large and small. We should consolidate some of our learnings from installing those indices into best practices in order to apply them to the new .alerts-default index.

Related issues and PRs:

@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) research labels Sep 20, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor

Linking with #111152.

@ymao1 ymao1 self-assigned this Oct 4, 2022
@ymao1 ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Oct 4, 2022
@ymao1
Copy link
Contributor Author

ymao1 commented Oct 12, 2022

Resources that need to be installed for framework alerts-as-data (FAAD??)

  • ILM policy
  • Component templates
  • Index template
  • Concrete write index

Summary of existing resource installation behavior

ILM Policy

Both the Event Log and Rule Registry plugins use a pretty standard ILM policy with a hot phase that rolls over at 50GB or 30 days. Event Log policy dictates a delete after 90 days and Rule Registry has no deletion phase. For framework alerts-as-data, we will likely want to keep the data around indefinitely as well. We should ensure we add _meta.managed:true to the policy to notify users that this is a managed policy and shouldn't be tinkered with.

Component templates

Event log does not use component templates. Rule registry installs 2 common component templates at plugin setup (ECS fields and the "technical field map") and then installs solution specific component templates as needed (on write). Framework alerts as data should use component templates and it makes sense to use two, one for ECS (that is auto-generated from the latest ECS fieldset) and one for the default alerts-as-data schema.

Index templates

Event Log
Event log creates a new index template with every Kibana version upgrade (version included in template name). This allows for index mapping updates between versions but there is no check to ensure that fields mappings remain compatible between versions. If ever there were mapping clashes between versions, queries could be written to take the version into account.

Rule Registry
Rule registry installs solution specific index templates as needed (on write) that include solution + namespace in the template name. Security creates an index template per namespace while Observability only creates a single template. The current Kibana version is written to the index template metadata. Only additive mapping updates are allowed to the index templates. To verify that a template update is compatible, the template is simulated first. If there are errors during the simulation (possible due to non-additive changes to the mapping), an error is thrown and rule registry writes are disabled for that index

Concrete write indices

Event Log
Event log creates a concrete write index for the versioned index template (version included in the index name).

Rule Registry
Rule registry creates a concrete write index as needed (on write) that includes solution + namespace in the index name. When index mappings are updated between versions, the mappings for all concrete indices (as retrieved via alias) are also updated (provided there are no errors during template simulation. If there are errors during template simulation (as can happen when partial snapshot indices exist), the mapping is not updated for the concrete index.

Suggested Framework Alerts-as-Data resource installation behavior

ILM Policy

ILM policy should follow the rule registry with a hot phase and no delete phase. Ensure that _meta.managed:true to surface managed policy warning to users and dissuade them from modifying the policy.

{
  _meta: {
    managed: true,
  },
  phases: {
    hot: {
      actions: {
        rollover: {
          max_age: '30d',
          max_primary_shard_size: '50gb',
        },
      },
    },
  },
}

Component templates

Framework alerts as data should only need two component templates: ECS component template and an alerts-as-data component template. The ECS component template should be auto-generated from ECS with a script, similar to what the event log does, except we need to pull over all ECS fields. When creating an index template composed of these two component templates, we should ensure the ECS component template is last to ensure we are using the official ECS mappings, just in case we define a field in the alerts-as-data component template with the same name.

Index template and concrete write index

We should ensure these settings are in the index template:

hidden: true
auto_expand_replicas: '0-1'

Our index strategy will depend on whether we want to allow non-additive schema changes between versions:

Only additive changes
If we decide that only additive changes will be allowed to the schema, we can re-use the rule registry strategy where mappings in the index template are updated and the mappings of every backing index are updated on schema change. Because we would be installing our resources at plugin setup vs on-demand, it should be easier to catch errors (mapping conflicts, not-allowed mapping changes) during development. We would not need to rollover the backing indices manually since we are directly updating the concrete index mappings so the ILM policy would handle rolling over at 30 days of 50 GB.

Non-additive changes allowed
Because we're reducing the complexity of the index template and index initialization by allowing just a single alerts-as-data index and because the framework will be in charge of the schema (vs the decentralized schema generation of the rule registry), we should have more freedom to allow non-additive schema changes to the index mapping between version upgrades. Although we are not using data streams because we want mutable documents, we can follow the general datastream guidance when it comes to mapping updates, which is: If you need to change the mapping of an existing field, create a new data stream and reindex your data into it. I think we can use this guidance to make the framework alerts-as-data paradigm more similar to the Event Log where we create a new index template and concrete write index per version. This would allow us to make non-additive schema changes between versions with the ability to target queries to specific versions if necessary while still querying across all alerts as data indices. This strategy does have the downside of creating more indices, one for each version, which, as discussed elsewhere, with 1000+ fields mapped puts more pressure on the heap. We have also seen with the event log this issue where the ILM policy continues to rollover empty indices although that seems to have been addressed recently by this PR. We should consider this additional impact to heap size when deciding whether to allow non-additive schema changes.

Retry

We have seen via various SDHs issues that crop up when the expected resources are not installed as expected due to external (usually ES related) errors. The rule registry currently doesn't contain much retry logic but we've recently added retry logic to event log initialization that we should look into reusing. We can also look at the retry logic used by the Fleet plugin which specifically looks for transient ES errors.

@ymao1 ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Oct 12, 2022
@mikecote
Copy link
Contributor

mikecote commented Oct 14, 2022

Great research! There's a lot learned that we'll be able to re-use for the FAAD indices!

If we decide that only additive changes will be allowed to the schema, we can re-use the rule registry strategy where mappings in the index template are updated and the mappings of every backing index are updated on schema change.

Yeah, I could see this approach being beneficial for the scenarios we mutate alerts as data documents. For example, if we add new workflow timestamps (#141464), the update operation could fail if a document gets updated and the corresponding index doesn't have the latest mappings (assuming we're in strict mode).

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 17, 2022

Closing as research as complete and can be referenced when we implement index bootstrapping for framework alerts as data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

3 participants