Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Detection Engine] Adds Alert Suppression to ML Rules #181926

Merged
merged 111 commits into from
Jul 2, 2024

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Apr 26, 2024

Summary

This PR introduces Alert Suppression for ML Detection Rules. This feature is behaviorally similar to alerting suppression for other Detection Engine Rule types, and nearly identical to the analogous features for EQL rules.

There are some additional UI behaviors introduced here as well, mainly intended to cover the shortcomings discovered in #183100. Those behaviors are:

  1. Populating the suppression field list with fields from the anomaly index(es).
  2. Disabling the suppression UI if no selected ML jobs are running (because we cannot populate the list of fields on which they'll be suppressing).
  3. Warning the user if some selected ML jobs are not running (because the list of suppression fields may be incomplete).

See screenshots below for more info.

Intermediate Serverless Deployment

As per the "intermediate deployment" requirements for serverless, while the schema (and declared alert SO mappings) will be extended to allow this functionality, the user-facing features are currently hidden behind a feature flag. Once this is merged and released, we can issue a "final" deployment in which the feature flag is enabled, and the feature effectively released.

Screenshots

  • Overview of new UI fields
    Screenshot 2024-05-16 at 3 22 02 PM
  • Example of Anomaly fields in suppression combobox
    Screenshot 2024-06-06 at 5 14 17 PM
  • Suppression disabled due to no jobs running
    Screenshot 2024-06-17 at 11 23 39 PM
  • Warning due to not all jobs running
    Screenshot 2024-06-17 at 11 26 16 PM

Steps to Review

  1. Review the Test Plan for an overview of behavior
  2. Review Integration tests for an overview of implementation and edge cases
  3. Review Cypress tests for an overview of UX changes
  4. Testing on Demo Instance (elastic/changeme)
    1. This instance has the relevant feature flag enabled, has some sample auditbeat data, as well as the anomalies archive data for the purposes of exercising an ML rule against "real" anomalies
    2. There are a few example rules in the default space:
      1. A simple query rule against auditbeat data
      2. An ML rule with per-execution suppression on both by_field_name and by_field_value (which ends up not actually suppressing anything)
      3. An ML rule with per-execution suppression on by_field_name (which suppresses all anomalies into a single alert)

Related Issues

Checklist

  • Functional changes are hidden behind a feature flag. If not hidden, the PR explains why these changes are being implemented in a long-living feature branch.
  • Functional changes are covered with a test plan and automated tests.
  • Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
  • Comprehensive manual testing is done by two engineers: the PR author and one of the PR reviewers. Changes are tested in both ESS and Serverless.
  • Mapping changes are accompanied by a technical design document. It can be a GitHub issue or an RFC explaining the changes. The design document is shared with and approved by the appropriate teams and individual stakeholders.
  • (OPTIONAL) OpenAPI specs changes include detailed descriptions and examples of usage and are ready to be released on https://docs.elastic.co/api-reference. NOTE: This is optional because at the moment we don't have yet any OpenAPI specs that would be fully "documented" and "GA-ready" for publishing on https://docs.elastic.co/api-reference.
  • Functional changes are communicated to the Docs team. A ticket is opened in https://github.com/elastic/security-docs using the Internal documentation request (Elastic employees) template. The following information is included: feature flags used, target ESS version, planned timing for ESS and Serverless releases.

rylnd added 5 commits April 25, 2024 21:52
This is mostly based on the current test plan. It's not wired up yet,
nor are there any actual implementations.
These now have type errors, since ML rules don't yet accept suppression
fields. We have our next task!
`node scripts/openapi/generate`
We're now asserting that suppression fields are present on the generated
alerts, which they're not, because we haven't implemented them yet.
That's the next step!
@rylnd rylnd added Feature:ML Rule Security Solution Machine Learning rule type Feature:Alert Suppression Security Solution Alert Suppression feature Team:Detection Engine Security Solution Detection Engine Area 8.15 candidate labels Apr 26, 2024
@rylnd rylnd self-assigned this Apr 26, 2024
rylnd added 13 commits May 1, 2024 17:29
* Adds call getIsSuppressionActive in our rule executor, and necessary
  dependencies
* Adds suppression fields to ML rule schema
* Adds feature flag for ML suppression
I noticed that it doesn't look like we're including a lot of timing info
in the ML executor; adding this to validate that, and document what we
_are_ recording.
This will light up the paths that we need to implement. Next!
This adds all the parameters necessary to invoke this method (if
relevant) in the ML rule executor. Given the relative simplicity of the
ML rule type, I'm guessing that many of these values are
irrelevant/unused in this case, but I haven't yet investigated that.

Next step is to exercise this implementation against the FTR tests, and
see if the behavior is what we expect. Once that's done, we can try to
pare down what we need/use.

I also added some TODOs in the course of this work to check some
potential bugs I noticed.
Tests were failing as rules were being created without suppression
params. Fixed!
We've got suppression fields making it into ML alerts for the first
time!

Now, to test the various suppression conditions.
I realized that most of these tests were using es_archiver to insert
anomalies into an index, but our tests were only ever using a single one
of those anomalies. In order to ensure these tests are independent of
the data in that archive, I've created and leveraged a helper to delete
all the persisted anomalies, and then use existing tooling to manually
insert the anomalies needed for our tests.

All of the current tests are green; there are just a few more
permutations that still need to be implemented.
This tests all of the interesting permutations of alert suppression for
ML rules, both with per-execution and interval suppression durations. I
added a few TODOs noting unexpected (to me) behavior; we'll see what
others think.
The behavior demonstrated in this test is in fact expected, as the
suppression duration window applies to the alert creation time, not the
original anomaly time.
@rylnd
Copy link
Contributor Author

rylnd commented May 8, 2024

/ci

rylnd added 3 commits May 8, 2024 14:34
Most other rule types have both a "fill" task and a "fillAndContinue"
task; this adds that pattern for ML rules on the Define step.
These are failing because I haven't yet enabled the suppression UI for
ML rules. Once that's done, we can start validating these tests.
rylnd added 5 commits July 1, 2024 13:13
Since some of these fields won't be mapped in the alerts index, we can't
always do the dynamic filter generation based on the suppression fields.
Until we have direction on how to handle that, we can at least display
the current alert by _id, and allow the analyst to expand the timeline
from there.
This mainly just composes some existing hooks that were previously
pieced together in the form itself (step_define_rule) into a new hook,
which is agnostic of the form itself.
Not sure where this came from, whether it was a bad conflict or just
some weird autocompletion that I missed.
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResponseOps changes LGTM. Only changes there were in our schema-change test, indicating new parameters in the rules, which could cause BWC / ZDT issues in rollbacks. Conversation in thread #181926 (comment) sounds like this will be handled appropriately.

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
securitySolution 5582 5585 +3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 15.6MB 15.6MB +3.3KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
securitySolution 83.7KB 83.8KB +68.0B
Unknown metric groups

References to deprecated APIs

id before after diff
securitySolution 571 574 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @rylnd

Copy link
Contributor

@michaelolo24 michaelolo24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigations code owner changes. Nice work!

Copy link
Contributor

@vitaliidm vitaliidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @rylnd

I tested suppression on rule interval and during rule execution and did not find issues.

Consider to add ftr tests as per #181926 (comment), since we had in past issues with enrichment not working with suppression.

Have you also had a chance to check whether tests flaky or not?
Since I see it's checked in description, but links lead to PR itself

Stability of new and changed tests is verified using the Flaky Test Runner in both ESS and Serverless. By default, use 200 runs for ESS and 200 runs for Serverless.
ESS - Cypress x 200
Serverless - Cypress x 200
ESS - API x 200
Serverless - API x 200

@vitaliidm
Copy link
Contributor

@rylnd

I also think that we should disable duration and missing fields checkboxes, when suppression fields controller is disabled.
Otherwise user can still edit it and save rule with new configuration

Screen.Recording.2024-07-02.at.16.38.16.mov

As they rely on a feature flag to function.
@rylnd
Copy link
Contributor Author

rylnd commented Jul 2, 2024

I also think that we should disable duration and missing fields checkboxes, when suppression fields controller is disabled.

@vitaliidm I'm looking into this but we need to revisit all of those suppression conditions in the rule form. I'm going to merge this PR as is, we can determine if there are any bugs with non-ML rules during testing tomorrow, and address the form fixes holistically as a followup.

@rylnd rylnd merged commit 2aa94a2 into elastic:main Jul 2, 2024
38 checks passed
@rylnd rylnd deleted the ml_rule_alert_suppression branch July 2, 2024 19:33
@kibanamachine kibanamachine added v8.15.0 backport:skip This commit does not require backporting labels Jul 2, 2024
@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6447

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 161/200 tests passed.

see run history

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6448

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/serverless.config.ts: 156/200 tests passed.

see run history

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6449

[❌] Security Solution Detection Engine - Cypress: 173/200 tests passed.

see run history

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6450

[❌] [Serverless] Security Solution Detection Engine - Cypress: 168/200 tests passed.

see run history

rylnd added a commit to rylnd/kibana that referenced this pull request Jul 12, 2024
This was requested during review of elastic#181926, and I'm circling back to
that now.
rylnd added a commit that referenced this pull request Jul 23, 2024
## Summary

This PR is a followup to #181926. It includes the following changes:

- Refactoring some Rule Form logic with `useMemo` 
- Requested [in this
discussion](#181926 (comment))
  - Addressed in a5fcf4d
- Adds FTR tests validating ML Suppression supports alert enrichment
- Requested [during previous
review](#181926 (comment))
  - Addressed in d5aa551
- Disables ML Suppression fields as a group
- Requested in [this
comment](#181926 (comment))
  - Addressed by 983945b


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [x] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jul 23, 2024
## Summary

This PR is a followup to elastic#181926. It includes the following changes:

- Refactoring some Rule Form logic with `useMemo`
- Requested [in this
discussion](elastic#181926 (comment))
  - Addressed in a5fcf4d
- Adds FTR tests validating ML Suppression supports alert enrichment
- Requested [during previous
review](elastic#181926 (comment))
  - Addressed in d5aa551
- Disables ML Suppression fields as a group
- Requested in [this
comment](elastic#181926 (comment))
  - Addressed by 983945b

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [x] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))

(cherry picked from commit e2150de)
rylnd added a commit to rylnd/kibana that referenced this pull request Jul 25, 2024
## Summary

This PR is a followup to elastic#181926. It includes the following changes:

- Refactoring some Rule Form logic with `useMemo` 
- Requested [in this
discussion](elastic#181926 (comment))
  - Addressed in a5fcf4d
- Adds FTR tests validating ML Suppression supports alert enrichment
- Requested [during previous
review](elastic#181926 (comment))
  - Addressed in d5aa551
- Disables ML Suppression fields as a group
- Requested in [this
comment](elastic#181926 (comment))
  - Addressed by 983945b


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [x] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.15 candidate backport:skip This commit does not require backporting Feature:Alert Suppression Security Solution Alert Suppression feature Feature:ML Rule Security Solution Machine Learning rule type release_note:enhancement Team:Detection Engine Security Solution Detection Engine Area v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.