Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear pending tasks in the worker when the context is canceled to avoid deadlocks in StopAndWait when tasks are queued for the worker. #62

Merged
merged 6 commits into from
Jun 17, 2024

Conversation

CorentinClabaut
Copy link
Collaborator

closes #61

…id deadlocks in StopAndWait when tasks are queued for the worker.
@alitto
Copy link
Owner

alitto commented Jun 8, 2024

Hey @CorentinClabaut, thanks for submitting this PR 🙌
I noticed the github actions workflow was specifying an unsopported version of go (1.15) so I pushed a fix to the master branch 3f439a7
Do you mind rebasing this PR so that we can retry the failed pipelines 🙂?

@CorentinClabaut
Copy link
Collaborator Author

Hey @alitto no problem for the PR :)
I've just merged master, let me know if something else needs to be done.

@alitto
Copy link
Owner

alitto commented Jun 10, 2024

It seems a test is failing due to a race condition 🤔 .
Apparently, the race condition appears in this line

  github.com/alitto/pond.(*WorkerPool).stop.func1()
      /home/runner/work/pond/pond/pond.go:358 +0x44

I wonder if that's related to moving the closing of the tasks channel up. Will continue digging when i get a chance

@CorentinClabaut
Copy link
Collaborator Author

I just saw this error as well

=== RUN   TestSubmitWithContextCancelWithIdleTasks
    pond_blackbox_test.go:582: Expected int32(1) but was int32(2)
--- FAIL: TestSubmitWithContextCancelWithIdleTasks (0.00s)

This one seems to be due to

select {
		case <-context.Done():
                   ...
		case task, ok := <-tasks:
                   ...
		}

Here if both channels contain something any of the case could be triggered (https://stackoverflow.com/questions/46200343/force-priority-of-go-select-statement)

I can push a fix for this and see if it fixes the race condition as well.

@CorentinClabaut
Copy link
Collaborator Author

Hey @alitto it should be all good now.
You were right, it was due to moving the closing of the tasks up

@CorentinClabaut
Copy link
Collaborator Author

Hey @alitto

The issue is now:

=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)

I'm not sure how this issue could be triggered by this PR though.

Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?

Copy link
Owner

@alitto alitto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments, looks good overall 👍

worker.go Show resolved Hide resolved
pond.go Show resolved Hide resolved
@alitto
Copy link
Owner

alitto commented Jun 12, 2024

Hey @alitto

The issue is now:

=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)

I'm not sure how this issue could be triggered by this PR though.

Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?

Mhm, I think the idleWorkerCount counter might be slower to update when running in Github actions, I have the feeling i've seen this behavior before.
Adding an extra sleep after submitting the first task should help i think:

// Submit a task to ensure at least 1 worker is started
pool.SubmitAndWait(func() {
    atomic.AddInt32(&doneCount, 1)
})

// Ensure idle worker count is updated
time.Sleep(1 * time.Millisecond)

assertEqual(t, 1, pool.IdleWorkers())

@CorentinClabaut
Copy link
Collaborator Author

CorentinClabaut commented Jun 17, 2024

Hey @alitto
The issue is now:

=== RUN   TestPurgeDuringSubmit
    pond_test.go:62: Expected int(1) but was int(0)
--- FAIL: TestPurgeDuringSubmit (0.00s)

I'm not sure how this issue could be triggered by this PR though.
Could it be that in some situations the purge might reset idleWorkerCount before we do the check in the test?

Mhm, I think the idleWorkerCount counter might be slower to update when running in Github actions, I have the feeling i've seen this behavior before. Adding an extra sleep after submitting the first task should help i think:

// Submit a task to ensure at least 1 worker is started
pool.SubmitAndWait(func() {
    atomic.AddInt32(&doneCount, 1)
})

// Ensure idle worker count is updated
time.Sleep(1 * time.Millisecond)

assertEqual(t, 1, pool.IdleWorkers())

That makes sense, I can see that it's what you have done in TestSubmitToIdle I'll add it here and in another test that seems to need it to make sure this issue doesn't reappear later

Copy link
Owner

@alitto alitto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @CorentinClabaut! I will take care of fixing the codecov action which is missing a token apparently and then also publish a new version of pond with your changes 🙂

@alitto alitto merged commit 8097a00 into alitto:master Jun 17, 2024
17 of 18 checks passed
@CorentinClabaut
Copy link
Collaborator Author

Thank you for the merge @alitto :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock occurs when the pool parent context is canceled with queued tasks in the worker.
2 participants