Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Improving health status check #93282

Merged
merged 6 commits into from
Mar 4, 2021

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Mar 2, 2021

Resolves #93062

Summary

Made several changes to the alerting health check:

  1. Added a share() to the combineLatest operator that combines the the core status with the alerting health status. I was seeing duplicate observable streams being created from getHealthStatusStream (88!), each firing at a 5 minute interval. Maybe it's possible that this many concurrent get requests to the task manager saved object was contributing to the 503 socket hangup errors?

  2. Moved catchError from the top level interval observable to within the switchMap. When catchError was at the top level, it would handle the error and complete the stream, which means once the alerting status became unavailable, it would stop polling for updated status and remain in an error state.

  3. Added a retryWhen operator which retries getting the status a few times before propagating the error status.

Checklist

Delete any items that are not applicable to this PR.

@ymao1 ymao1 self-assigned this Mar 2, 2021
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Mar 2, 2021
@ymao1 ymao1 marked this pull request as ready for review March 2, 2021 20:20
@ymao1 ymao1 requested a review from a team as a code owner March 2, 2021 20:20
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1
Copy link
Contributor Author

ymao1 commented Mar 2, 2021

@elasticmachine merge upstream

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Awesome improvements!

for (let i = 0; i < MAX_RETRY_ATTEMPTS + 1; i++) {
await tick();
jest.advanceTimersByTime(retryDelay);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an assertion that mockTaskManager.get was actually called MAX_RETRY_ATTEMPTS times?

Otherwise, in theory, this test. won't catch anything if the stream never emits any values as all expects are inside the subscription handler.
(this is from experience... I've missed a bug before because I made the exact same mistake) :)

Comment on lines 71 to 77
interval(pollInterval)
.pipe(
switchMap(() =>
getHealthServiceStatusWithRetryAndErrorHandling(mockTaskManager, retryDelay)
)
)
.subscribe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using getHealthStatusStream directly here with the interval being an argument (we can use default value in the implementation)?

That way the unit tests ensure the composition is behaving as expected.
Just incase someone changes the switchMap in getHealthStatusStream in the future to something that behaved differently... 🤔

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a logical perspective.
I haven't been able to test this locally because I'm not sure how to cause the failure case.

Any advice on whether that can be tested? 🤔

@ymao1
Copy link
Contributor Author

ymao1 commented Mar 3, 2021

Any advice on whether that can be tested? 🤔

I did something like this in getLatestTaskState:

let getLatestTaskState = 0;
const shouldThrowError = [false, false, true, true, true, false, false...];
async function getLatestTaskState(taskManager: TaskManagerStartContract) {
  if (getLatestTaskState < shouldThrowError.length && shouldThrowError[getLatestTaskState]) {
    throw new Error();
  }
  getLatestTaskState++;
  
  try {
    const result = await taskManager.get(HEALTH_TASK_ID);
  }

}

Also shortened the HEALTH_STATUS_INTERVAL and RETRY_DELAY so it wouldn't take so long to run.

I changed the sequence in shouldThrowError to test out successful retries and maxing out retry attempts and made sure that even after a maxed out retry that returned unavailable, the interval continued polling.

@pmuellr
Copy link
Member

pmuellr commented Mar 3, 2021

I was seeing duplicate observable streams being created from getHealthStatusStream (88!), each firing at a 5 minute interval. Maybe it's possible that this many concurrent get requests to the task manager saved object was contributing to the 503 socket hangup errors?

Aha!

I was really curious how we were seeing 50x responses, since I thought these were auto-retried and such. It seems like it would make sense that if we had that many requests running, at least one of them could end up failing on each retry, and finally give up and return the 50x as the final reply.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 added the auto-backport Deprecated - use backport:version if exact versions are needed label Mar 4, 2021
@ymao1 ymao1 merged commit cad2653 into elastic:master Mar 4, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Mar 4, 2021
* wip

* Moving catchError so observable stream does not complete. Adding retry on failure

* Using retryWhen. Updating unit tests

* PR fixes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

7.x / #93639

Successful backport PRs will be merged automatically after passing CI.

gmmorris added a commit to gmmorris/kibana that referenced this pull request Mar 4, 2021
* master: (107 commits)
  [Logs UI] Fix log stream data fetching (elastic#93201)
  [App Search] Added relevance tuning search preview (elastic#93054)
  [Canvas] Fix reports embeddables (elastic#93482)
  [ILM] Added new functional test in ILM for creating a new policy (elastic#92936)
  Remove direct dependency on statehood package (elastic#93592)
  [Maps] Track tile loading status (elastic#91585)
  [Discover][Doc] Improve main documentation (elastic#91854)
  [Upgrade Assistant] Disable UA and add prompt (elastic#92834)
  [Snapshot Restore] Remove cloud validation for slm policy (elastic#93609)
  [Maps] Support GeometryCollections in GeoJson upload (elastic#93507)
  [XY Charts] fix partial histogram endzones annotations (elastic#93091)
  [Core] Simplify context typings (elastic#93579)
  [Alerting] Improving health status check (elastic#93282)
  [Discover] Restore context documentation (elastic#90784)
  [core-docs] Edits core-intro section for the new docs system (elastic#93540)
  add missed codeowners (elastic#89714)
  fetch node labels via script execution (elastic#93225)
  [Security Solution] Adds getMockTheme function (elastic#92840)
  Sort dependencies in package.json correctly (elastic#93590)
  [Bug] missing timepicker:quickRanges migration (elastic#93409)
  ...
kibanamachine added a commit that referenced this pull request Mar 4, 2021
* wip

* Moving catchError so observable stream does not complete. Adding retry on failure

* Using retryWhen. Updating unit tests

* PR fixes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: ymao1 <ying.mao@elastic.co>
@ymao1 ymao1 deleted the alerting/health-check branch March 25, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[alerting] sticky "red" status when getting a 50x error calculating alerting health status
6 participants