-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Uptime] Use scripted metric for snapshot calculation #58247
Conversation
Pinging @elastic/uptime (Team:uptime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few questions and suggestions for cleaning, naming, commenting, but the base code looks good to me. I also still need to finish a functional review.
return state; | ||
`, | ||
reduce_script: ` | ||
// Use a treemap since it's later traversable in sorted order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's later traversable in sorted order
I'm not familiar with the TreeMap
class, am I understanding correctly that it is self-balancing? Meaning as keys are inserted, it handles the sort based on the comparison function you provide to merge
below?
I.e. if I have a map with keys 1, 4, 5
and I insert 3
, then traverse the entrySet
, it will iterate like 1 3 4 5
?
If that's correct, it might be good to expand this comment a little, since we are writing Java in a TypeScript file; it's reasonable that someone viewing this code might not be able to understand it easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it will maintain the keys in order. Merge doesn't have anything to do with the sorting, I've added a comment below that explains that. Merge just updates the value if we have a more recent check from the same location.
The order of the treemap uses the built-in compareTo implementation of java's String class.
|
||
// Parse the length delimited id/location strings described in the map section | ||
int colonIndex = idLoc.indexOf(":"); | ||
int idEnd = Integer.parseInt(idLoc.substring(0, colonIndex), 16) + colonIndex + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 16 the radix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, since we hex encode the numbers for density
...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts
Outdated
Show resolved
Hide resolved
String loc = idLoc.substring(idEnd, idLoc.length()); | ||
String status = timeStatus.substring(timeStatus.length() - 1); | ||
|
||
locTotals.compute(loc, (k,v) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment heading this block would be helpful be useful to a javascript developer 😅.
My understanding is we are updating the value for key loc
, and the output of the provided function determines the new value. If the value was null
, we create a new HashMap, then we increment appropriate values based on the documents we iterate over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I'll add a comment
...cy/plugins/uptime/server/lib/adapters/monitor_states/elasticsearch_monitor_states_adapter.ts
Outdated
Show resolved
Hide resolved
counts[leastCommonStatus] = await slowStatusCount(context, leastCommonStatus); | ||
counts[mostCommonStatus] = counts.total - counts[leastCommonStatus]; | ||
} | ||
const counts = await statusCount(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'd be better to name this function getStatusCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if get
has any particular meaning at least in my head, unless there's something to juxtapose it against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair
}; | ||
}; | ||
|
||
const slowStatusCount = async (context: QueryContext, status: string): Promise<number> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now rather than having a fast/slow count, we're able to just have one counter (slower, but still fast, and always accurate), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, WFG
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
Fixes #58079 This is an improved version of #58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
…elastic#58389) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
…re/files-and-filetree * 'master' of github.com:elastic/kibana: (174 commits) [SIEM] Fix unnecessary re-renders on the Overview page (elastic#56587) Don't mutate error message (elastic#58452) Fix service map popover transaction duration (elastic#58422) [ML] Adding filebeat config to file dataviz (elastic#58152) [Uptime] Improve refresh handling when generating test data (elastic#58285) [Logs / Metrics UI] Remove path prefix from ViewSourceConfigur… (elastic#58238) [ML] Functional tests - adjust classification model memory (elastic#58445) [ML] Use event.timezone instead of beat.timezone in file upload (elastic#58447) [Logs UI] Unskip and stabilitize log column configuration tests (elastic#58392) [Telemetry] Separate the license retrieval from the stats in the usage collectors (elastic#57332) hide welcome screen for cloud (elastic#58371) Move src/legacy/ui/public/notify/app_redirect to kibana_legacy (elastic#58127) [ML] Functional tests - stabilize typing during df analytics creation (elastic#58227) fix short url in spaces (elastic#58313) [SIEM] Upgrades cypress to version 4.0.2 (elastic#58400) [Index management] Move to new platform "plugins" folder (elastic#58109) [kbn/optimizer] disable parallelization in terser plugin (elastic#58396) [Uptime] Delete useless try...catch blocks (elastic#58263) [Uptime] Use scripted metric for snapshot calculation (elastic#58247) (elastic#58389) [APM] Stabilize agent configuration API (elastic#57767) ... # Conflicts: # src/plugins/console/public/application/containers/editor/legacy/console_editor/editor.tsx
…elastic#58389) (elastic#58415) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Summary
Fixes #58079
This is an improved version of #58078
Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.
This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.
We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.
I attempted to keep memory usage relatively slow by using simple maps of strings.
Checklist
Delete any items that are not applicable to this PR.
For maintainers