Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(*): avoid creating multiple timers to run the same active check #156

Merged
merged 14 commits into from
May 14, 2024

Conversation

windmgc
Copy link
Member

@windmgc windmgc commented May 9, 2024

This is an alternative of #155.

The PR adds a share dict key(lock) indicating that an active check job(timer) has been created, and the key will be removed when the active check job has finished. This avoids multiple active jobs being created/running for the same active checker.

Alleviates FTI-5847

@windmgc windmgc changed the title fix(*): avoid creating multiple pending timers to run active check fix(*): avoid creating multiple timers to run the same active check May 10, 2024
@windmgc
Copy link
Member Author

windmgc commented May 10, 2024

The badssl.com website seems to have some long latency thus causing the tests to fail intermittently. I added a long timeout and raised the whole timeout of the test.
Future we can consider replace this with local SSL server to eliminate flakiness

@windmgc
Copy link
Member Author

windmgc commented May 10, 2024

Tested within Kong gateway 3.4.3.5 with limited running timer concurrency(32), and the fix can succeed limiting the pending healthcheck job to a reasonable number.

@windmgc windmgc requested review from locao and oowl May 10, 2024 08:03
oowl
oowl previously approved these changes May 10, 2024
Copy link
Member

@oowl oowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGTM

@windmgc
Copy link
Member Author

windmgc commented May 10, 2024

@oowl @locao I added another test for checking running timers when active healthcheck is enabled. Without the fix, the running timer will be ~10 for a single target.

Could you please help review it again? Thanks!

@locao
Copy link
Contributor

locao commented May 13, 2024

I have some concerns, but this change seems needed either way. Please note that I am writing this most out of my memory (which is not very good 😄).

Does this change fixes the issue in the long run? I mean, if we have enough problematic targets, the available timers would be exhausted either way, wouldn't them?

Also, did you check where the execution is waiting forever/for a long time? I believe it's in the run_single_check function, right? I think we should find out why it's never timing out. If we're still risking to exhaust the timers with enough unhealthy targets, maybe we should add some kind of watchdog there to stop the check if it is taking too long to finish.

What do you think?

@windmgc
Copy link
Member Author

windmgc commented May 14, 2024

@locao Thank you for your review during busy hours!

Does this change fixes the issue in the long run? I mean, if we have enough problematic targets, the available timers would be exhausted either way, wouldn't them?

No, the number of "pending" timer will persist within the same level of how many targets we've configured inside the healthchecker. Which means that the number "pending" timer grows according to the scale of targets, and not multiplied by the interval size. Before the change, if an active healthcheck cannot be finished within the interval, we will still create new jobs when next interval come, which will result in indefinite growth in the pending active check jobs. But after the change it will not grow indefinitely.

Also, did you check where the execution is waiting forever/for a long time? I believe it's in the run_single_check function, right? I think we should find out why it's never timing out. If we're still risking to exhaust the timers with enough unhealthy targets, maybe we should add some kind of watchdog there to stop the check if it is taking too long to finish.

The execution of a single active healthcheck works as expected, it reaches time out as configured. But the real problem is the time to reach a timeout(or a failure/success, or whatever the result is) may be larger than the interval, and we're "producing" active check jobs(by creating timers) every interval, so the "consuming" speed of the active jobs(that already created) cannot catch up, that result in the indefinite growth and the exhaustion of the timers.

The final result of the "timer exhaustion" may behave differently when we use native timer or timer-ng. In native timer we may encounter "max running timers are not enough", but in timer-ng, if the concurrency limitation is not enough, it will result in huge number of pending "jobs" inside timer-ng's datastructure and memory oversize.

@locao
Copy link
Contributor

locao commented May 14, 2024

@windmgc thanks for the detailed reply!

Which means that the number "pending" timer grows according to the scale of targets, and not multiplied by the interval size.

So if we have more problematic targets than available timers, we would exhaust them either way. TBH, the scenario I mentioned is way, way more unlikely to happen than what we had before your fix, and maybe not even fixable without a big overhaul of the library.

This is a big improvement, well done!

Copy link
Contributor

@locao locao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@locao locao merged commit fca7ad2 into release/1.6.x May 14, 2024
7 checks passed
windmgc added a commit that referenced this pull request May 16, 2024
…157)

* fix(*): avoid creating multiple timers to run the same active check (#156)

* docs(changelog): add changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants