-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Improving health status check #93282
Conversation
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Awesome improvements!
for (let i = 0; i < MAX_RETRY_ATTEMPTS + 1; i++) { | ||
await tick(); | ||
jest.advanceTimersByTime(retryDelay); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an assertion that mockTaskManager.get
was actually called MAX_RETRY_ATTEMPTS
times?
Otherwise, in theory, this test. won't catch anything if the stream never emits any values as all expect
s are inside the subscription handler.
(this is from experience... I've missed a bug before because I made the exact same mistake) :)
interval(pollInterval) | ||
.pipe( | ||
switchMap(() => | ||
getHealthServiceStatusWithRetryAndErrorHandling(mockTaskManager, retryDelay) | ||
) | ||
) | ||
.subscribe(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth using getHealthStatusStream
directly here with the interval being an argument (we can use default value in the implementation)?
That way the unit tests ensure the composition is behaving as expected.
Just incase someone changes the switchMap
in getHealthStatusStream
in the future to something that behaved differently... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from a logical perspective.
I haven't been able to test this locally because I'm not sure how to cause the failure case.
Any advice on whether that can be tested? 🤔
I did something like this in
Also shortened the I changed the sequence in |
Aha! I was really curious how we were seeing 50x responses, since I thought these were auto-retried and such. It seems like it would make sense that if we had that many requests running, at least one of them could end up failing on each retry, and finally give up and return the 50x as the final reply. |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: cc @ymao1 |
* wip * Moving catchError so observable stream does not complete. Adding retry on failure * Using retryWhen. Updating unit tests * PR fixes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
* master: (107 commits) [Logs UI] Fix log stream data fetching (elastic#93201) [App Search] Added relevance tuning search preview (elastic#93054) [Canvas] Fix reports embeddables (elastic#93482) [ILM] Added new functional test in ILM for creating a new policy (elastic#92936) Remove direct dependency on statehood package (elastic#93592) [Maps] Track tile loading status (elastic#91585) [Discover][Doc] Improve main documentation (elastic#91854) [Upgrade Assistant] Disable UA and add prompt (elastic#92834) [Snapshot Restore] Remove cloud validation for slm policy (elastic#93609) [Maps] Support GeometryCollections in GeoJson upload (elastic#93507) [XY Charts] fix partial histogram endzones annotations (elastic#93091) [Core] Simplify context typings (elastic#93579) [Alerting] Improving health status check (elastic#93282) [Discover] Restore context documentation (elastic#90784) [core-docs] Edits core-intro section for the new docs system (elastic#93540) add missed codeowners (elastic#89714) fetch node labels via script execution (elastic#93225) [Security Solution] Adds getMockTheme function (elastic#92840) Sort dependencies in package.json correctly (elastic#93590) [Bug] missing timepicker:quickRanges migration (elastic#93409) ...
* wip * Moving catchError so observable stream does not complete. Adding retry on failure * Using retryWhen. Updating unit tests * PR fixes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: ymao1 <ying.mao@elastic.co>
Resolves #93062
Summary
Made several changes to the alerting health check:
Added a
share()
to thecombineLatest
operator that combines the the core status with the alerting health status. I was seeing duplicate observable streams being created fromgetHealthStatusStream
(88!), each firing at a 5 minute interval. Maybe it's possible that this many concurrentget
requests to the task manager saved object was contributing to the503 socket hangup
errors?Moved
catchError
from the top levelinterval
observable to within the switchMap. WhencatchError
was at the top level, it would handle the error and complete the stream, which means once the alerting status becameunavailable
, it would stop polling for updated status and remain in an error state.Added a
retryWhen
operator which retries getting the status a few times before propagating the error status.Checklist
Delete any items that are not applicable to this PR.