-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders #124912
[Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders #124912
Conversation
x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: #93770
I'm not sure about this PR @FrankHassanabad. Instead of doing one upfront call, you are making significantly more network calls to Elasticsearch and potentially slowing down the entire usage collector for Kibana, and possibly timing it out on our end resulting in no telemetry. As telemetry processing is time sensitive I'm not sure size: 10_000
advice really applies to this, but rather sporadic and ad-hoc requests to ES in Kibana APIs.
Also, are you sure the PIT API applies to aggregations? The documentation only discusses queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html
AFAIK, Snapshot telemetry is collected via the /api/telemetry/v2/clusters/_stats
which times out after one minute. Really our usage collector needs to finish processing within one minute for us to get anything. You are keeping the connection live in some cases up to 5m.
I will continue to test it, but I think we should briefly meet to discuss these changes. We need to do some profiling on a production-like cluster. I think the folder reorg is fine.
@afharo is a very strong ally of the security business from a stack perspective. Alejandro, can you run your eyes over this PR, please? |
Edit: Looking at other people's telemetry code I see the pattern using From this PR: So I think I'm on the right track and have adjusted the default from 100 to 1k in the constants and updated the description. Let me know if there's something else for me to do with this. The PIT recommendations are from here and this is the ticket we have to solve based off it:
So I am following what they're saying as far as I know if I'm using SO PIT or if I'm using ES PIT. For their medium term goals they are kind of saying from kibana core on that ticket that they are going to start doing throws in development mode:
which is why I added PIT to the queries.
Edit: Yes, asking on slack I got back an affirmative that this is ok 👍
Edit: See above, I changed it to 1k like the others are in the code base.
Edit: See above, I changed it to 1k like the others in the code base. Hopefully we are good with this and timing.
The PIT does not equal the connection. The PIT has a timeout of 5m, meaning that if it is not explicitly released within 5m, then it will expire for you regardless within 5m. Unless an exception occurs, those handles will be closed by us before 5m. They will be closed when the querying is completed. We can reduce it to 1m if we only have 1 minute to query. |
const moduleJobs = modules.flatMap((module) => module.jobs); | ||
const jobs = await mlClient.jobServiceProvider(fakeRequest, savedObjectsClient).jobsSummary(); | ||
|
||
jobsUsage = jobs.filter(isSecurityJob).reduce((usage, job) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting thing I noticed last week is that this doesn't represent custom security jobs accurately. This is because updateMlJobUsage
checks for if a job has been tagged with siem
or security
, but it is pretty unlikely that a user is going to create a custom job and label it with our labels. We probably don't want to filter with this condition for custom ML jobs and simplify updateMlJobUsage
, but you don't need to worry about this in this PR. Just making you aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am not going to change any of this part at this point for this PR.
x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.mocks.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.mocks.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/security_solution/server/usage/queries/utils/fetch_hits_with_pit.ts
Outdated
Show resolved
Hide resolved
Pete's notesWhen I start Kibana from your PR I get the following error message (before any rules / ml jobs loaded in):
Manual Actions (Detection Rules)
https://gist.github.com/pjhampton/17b0f353a0dae0da20f8a35ff50117d3 Manual Actions (Machine Learning)
https://gist.github.com/pjhampton/0c4a2cb17ae914c54b239f1f49e23bfb
Overall I think this PR is good and works as expected except for a couple of caveats. I really like the new folder structure, I think that is the biggest win of this PR. I think given you have fixed the alert counts for detection rules in this PR we should ship it in 8.1 assuming you can get it merged this week. |
Thanks for testing it out... For this error you're seeing below: $ gh pr checkout 124912
$ yarn start --no-base-path
...
> [2022-02-14T11:21:47.701+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.
$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true'
> [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped. I think that has existed for a while but I don't know if that is expected from the ML API or not. It does appear to be coming from their API. I don't know if there is another way we should change the query but my changes aren't causing this as something "new" happening. I will change the |
/** | ||
* Same as "RuleSearchResult" just a partial is applied. Useful for unit tests since these types aren't exact. | ||
*/ | ||
export type PartialRuleSearchResult = Omit< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead type. Already removed in a follow up commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine.
Please reduce the logging level to debug in this PR before merge instead of info. I mistyped in one of my previous comments.
💛 Build succeeded, but was flakyTest Failures
Metrics [docs]Unknown metric groupsESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
This comment was marked as outdated.
This comment was marked as outdated.
Friendly reminder: Looks like this PR hasn’t been backported yet. |
2 similar comments
Friendly reminder: Looks like this PR hasn’t been backported yet. |
Friendly reminder: Looks like this PR hasn’t been backported yet. |
v8.2.0 does not need backporting today... adding the label |
Summary
Changes the usage collector telemetry within security solutions to use PIT (Point in Time) and a few other bug fixes and restructuring.
caseComments
and theSanitized Alerts
and theML job types
using Partial and other TypeScript tricks.Checklist
Delete any items that are not applicable to this PR.