Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race on timer in util.queue. #1285

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

michelle192837
Copy link
Collaborator

We're hitting an issue occasionally where Tabulator will stall out, which seems to be caused by the queue getting stuck during 'sleep()', likely because the timer channel has already been pulled from in the second select case, and the wait on receiving a value from the timer channel after timer.Stop() fails will block indefinitely.

Instead, ensure that timer.Stop() is definitely called, but don't bother to drain or close the channel. (Note that timer.Stop() doesn't stop the channel, but it should be cleaned up due to being out of scope once sleep() completes).

We're hitting an issue occasionally where Tabulator will stall out,
which seems to be caused by the queue getting stuck during 'sleep()',
likely because the timer channel has already been pulled from in the
second select case, and the wait on receiving a value from the timer
channel after timer.Stop() fails will block indefinitely.

Instead, ensure that timer.Stop() is definitely called, but don't bother
to drain or close the channel. (Note that timer.Stop() doesn't stop the
channel, but it should be cleaned up due to being out of scope once
sleep() completes).
start := time.Now()
select {
case <-q.signal:
if !sleep.Stop() {
<-sleep.C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Based on the documentation for Timer.Stop this will block forever if the timer already fired.

It returns ... false if the timer has already expired or been stopped.

This seems like a pretty rare race since the select statement also watches the timer, but if q.sleep() is called when there is already a signal pending this would happen reliably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the logs are not logging every occurence of this since they stick to INFO level, but I suspect given the frequency of updates that sleep and rouse are both happening quite frequently in production. (I'm guessing steady increase in traffic has made this more likely over time, so we're seeing it more often recently).

Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cjwagner, michelle192837

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 1192f6a into GoogleCloudPlatform:main Jul 9, 2024
5 checks passed
@michelle192837 michelle192837 deleted the queuefix branch July 9, 2024 18:34
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants