Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Request Locks With Task Lag #2244

Merged
merged 9 commits into from
Dec 1, 2021
Merged

Avoid Request Locks With Task Lag #2244

merged 9 commits into from
Dec 1, 2021

Conversation

WH77
Copy link
Contributor

@WH77 WH77 commented Nov 16, 2021

This PR prevents request locks from being acquired by lower priority possibly long running components while the request has late pending tasks, to allow higher priority components like the offer scheduler to acquire them. It does not preempt lower priority components, so lag for a request may still build up while they're holding the lock.

List of components using the request lock:

  • SingularityMesosStatusUpdateHandler
  • SingularityStartup
  • DeployResource
  • SingularityAutoScaleSpreadAllPoller
  • SingularityCleaner (lb/requests/tasks)
  • SingularityCrashLoopChecker
  • SingularityExpiringUserActionPoller
  • SingularityScheduler
  • SingularitySchedulerPoller
  • SingularityUpstreamChecker
  • SingularityDeployHistoryPersister
  • SingularityRequestHistoryPersister
  • SingularityMesosOfferScheduler

Somewhat concerned about this change introducing difficult bugs, because previously lock acquisition would wait for the lock forever instead of short circuiting. Not sure if it would be better to throw an error instead of logging+returning.

cc - @ssalinas, @pschoenfelder

Copy link
Member

@ssalinas ssalinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with you on being worried about these pollers getting stuck for too long. What if each poller also kept track of how long it had been since it last ran, and if too long -> grab high priority lock instead. Too long here could be in the 10m + range for things like persisters, maybe less for some of the others


private void updateLateTasksByRequestId() {
long now = System.currentTimeMillis();
if (now - lastUpdate > 1000L * configuration.getRequestCacheTtl()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about request cache ttl at all here? Shouldn't this be running on the leader only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly me not wanting to add another configuration option, and thinking that request cache ttl was an appropriate value for this.

Comment on lines 30 to 31
updateLateTasksByRequestId();
return lateTasksByRequestId.containsKey(requestId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about this being called in a super tight loop. I know it's mostly going to no-op on the if statement. But, does it make sense to be updating this on a thread that is trying to grab a lock vs just doing it in the background? Then we can be sure every check is just a map.containsKey and never has to worry about heavier calculations on an active path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thinking was that a scheduled thread would be overkill for this, but you're right that this is going to be a hot method and that I should add a background thread specifically for this.

@WH77
Copy link
Contributor Author

WH77 commented Nov 19, 2021

@ssalinas - I opted to allow low priority lock requests to go through if the request lock isn't currently locked to avoid pollers getting stuck due to task lag that is unrelated to request lock usage.

Do you think that I should still force them to become high priority after enough delay, or is this change enough?

@pschoenfelder
Copy link
Contributor

Looks good. I think your suggestion about allowing low priority to run when initially unlocked makes sense.

I may have missed some convo on this last week — has anything been discussed about Steve's suggestion to have each poller determine priority based on time since last run? Not necessary, just curious. Maybe a good future improvement?

Copy link
Member

@ssalinas ssalinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@WH77 WH77 merged commit 8e4a548 into master Dec 1, 2021
@WH77 WH77 deleted the task-lag-guardrail branch December 1, 2021 19:19
@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants