Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Uptime] Use scripted metric for snapshot calculation #58247

Merged
merged 7 commits into from
Feb 24, 2020

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Feb 21, 2020

Summary

Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@andrewvc andrewvc added bug Fixes for quality problems that affect the customer experience [zube]: In Review Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.6.1 labels Feb 21, 2020
@andrewvc andrewvc self-assigned this Feb 21, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions and suggestions for cleaning, naming, commenting, but the base code looks good to me. I also still need to finish a functional review.

return state;
`,
reduce_script: `
// Use a treemap since it's later traversable in sorted order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's later traversable in sorted order

I'm not familiar with the TreeMap class, am I understanding correctly that it is self-balancing? Meaning as keys are inserted, it handles the sort based on the comparison function you provide to merge below?

I.e. if I have a map with keys 1, 4, 5 and I insert 3, then traverse the entrySet, it will iterate like 1 3 4 5?

If that's correct, it might be good to expand this comment a little, since we are writing Java in a TypeScript file; it's reasonable that someone viewing this code might not be able to understand it easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it will maintain the keys in order. Merge doesn't have anything to do with the sorting, I've added a comment below that explains that. Merge just updates the value if we have a more recent check from the same location.

The order of the treemap uses the built-in compareTo implementation of java's String class.


// Parse the length delimited id/location strings described in the map section
int colonIndex = idLoc.indexOf(":");
int idEnd = Integer.parseInt(idLoc.substring(0, colonIndex), 16) + colonIndex + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 16 the radix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, since we hex encode the numbers for density

String loc = idLoc.substring(idEnd, idLoc.length());
String status = timeStatus.substring(timeStatus.length() - 1);

locTotals.compute(loc, (k,v) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment heading this block would be helpful be useful to a javascript developer 😅.

My understanding is we are updating the value for key loc, and the output of the provided function determines the new value. If the value was null, we create a new HashMap, then we increment appropriate values based on the documents we iterate over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. I'll add a comment

counts[leastCommonStatus] = await slowStatusCount(context, leastCommonStatus);
counts[mostCommonStatus] = counts.total - counts[leastCommonStatus];
}
const counts = await statusCount(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be better to name this function getStatusCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if get has any particular meaning at least in my head, unless there's something to juxtapose it against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair

};
};

const slowStatusCount = async (context: QueryContext, status: string): Promise<number> => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now rather than having a fast/slow count, we're able to just have one counter (slower, but still fast, and always accurate), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

@andrewvc
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, WFG

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@andrewvc andrewvc merged commit c11e866 into elastic:7.6 Feb 24, 2020
@andrewvc andrewvc deleted the scripted-metric-count branch February 24, 2020 17:45
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 24, 2020
Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit that referenced this pull request Feb 24, 2020
Fixes #58079

This is an improved version of #58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc added a commit to andrewvc/kibana that referenced this pull request Feb 24, 2020
…elastic#58389)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.
jloleysens added a commit to jloleysens/kibana that referenced this pull request Feb 25, 2020
…re/files-and-filetree

* 'master' of github.com:elastic/kibana: (174 commits)
  [SIEM] Fix unnecessary re-renders on the Overview page (elastic#56587)
  Don't mutate error message (elastic#58452)
  Fix service map popover transaction duration (elastic#58422)
  [ML] Adding filebeat config to file dataviz (elastic#58152)
  [Uptime] Improve refresh handling when generating test data (elastic#58285)
  [Logs / Metrics UI] Remove path prefix from ViewSourceConfigur… (elastic#58238)
  [ML] Functional tests - adjust classification model memory (elastic#58445)
  [ML] Use event.timezone instead of beat.timezone in file upload (elastic#58447)
  [Logs UI] Unskip and stabilitize log column configuration tests (elastic#58392)
  [Telemetry] Separate the license retrieval from the stats in the usage collectors (elastic#57332)
  hide welcome screen for cloud (elastic#58371)
  Move src/legacy/ui/public/notify/app_redirect to kibana_legacy (elastic#58127)
  [ML] Functional tests - stabilize typing during df analytics creation (elastic#58227)
  fix short url in spaces (elastic#58313)
  [SIEM] Upgrades cypress to version 4.0.2 (elastic#58400)
  [Index management] Move to new platform "plugins" folder (elastic#58109)
  [kbn/optimizer] disable parallelization in terser plugin (elastic#58396)
  [Uptime] Delete useless try...catch blocks (elastic#58263)
  [Uptime] Use scripted metric for snapshot calculation (elastic#58247) (elastic#58389)
  [APM] Stabilize agent configuration API (elastic#57767)
  ...

# Conflicts:
#	src/plugins/console/public/application/containers/editor/legacy/console_editor/editor.tsx
elasticmachine added a commit to dhurley14/kibana that referenced this pull request Feb 25, 2020
…elastic#58389) (elastic#58415)

Fixes elastic#58079

This is an improved version of elastic#58078

Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.

This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.

We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added.

I attempted to keep memory usage relatively slow by using simple maps of strings.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience release_note:fix Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v7.6.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants