Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang in KqueueExecutorScheduler #65

Merged
merged 6 commits into from
Sep 18, 2022

Conversation

LeeTibbert
Copy link
Collaborator

Fix #64
KqueueExecutorScheduler#poll no longer hangs when there is no work to do.
The fix was actually to check for the existence of callbacks, rather than of
new events.

Manually tested on macOS M1, Monterrey 12.6 with test-cond-in-development
for Issue #52. I will be exercising this fix as I continue to work on Issue #52.

Comment on lines 68 to 71
if (!noCallbacks) {
val timeoutSpec =
if (timeoutIsInfinite || timeoutIsZero) null
else {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are no callbacks, but there is 0 < timeout < infinity? In that case poll should still sleep for the requested time. This is not necessary for correctness, but it is important for performance (if we don't sleep, we are busy-looping). That is why the test suite currently doesn't catch this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the timely review and tag-team development.

re: 0 < timeout < infinity & busy looping

Isn't prevention of busy looping the responsibility of the caller? I do not know
that scheduler code. What is the usually timeout? Zero, something else.

The hang that I saw was large for a scheduler. I know I let it run two or
three minutes whilst I did code fixes in another window. In a debug
branch I can put some code in to see if it is 0 (which can mean either
return immediately (do not wait), or wait-forever) or a largish number.

I thought that poll() was supposed to minimize wait and return to
the caller, so the caller could schedule something else.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perhaps the documented contract for poll needs to be clarified.

I thought that poll() was supposed to minimize wait and return to
the caller, so the caller could schedule something else.

When the caller invokes poll(...) with a timeout of x seconds, it means that the caller has no work scheduled for x seconds and thus it is fully delegating to poll(...). If poll returns before timeout (and no new work was scheduled due to an I/O event), then the caller will simply invoke poll(...) again for the remaining time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That info helps. The poll is being called with INFINITE timeout, 2 events change, and 1 callback.
There was a open & close which came and went before poll() got called. I suspect the math
is not working out and an extra event-change is sneaking in.

I am going to move this to WIP, do more tracing, and read the URL you posted.
I'll post the provoking code in the issue.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I believe your general approach is correct, we just need to adjust the conditional to account for the case when 0 < timeout < infinity in which case it should proceed with the kevent64() call.

@LeeTibbert LeeTibbert changed the title Fix #64: Remove a KqExec poll hang WIP: Fix #64: Remove a KqExec poll hang Sep 16, 2022
@armanbilge
Copy link
Owner

I was able to replicate this in a very simple test. Give it a try, and if it reproduces the issue for you as well, I propose we add it to the test suite.

  test("immediately closing a socket does not hang") {
    IOSocketChannel.open.use_
  }

@LeeTibbert
Copy link
Collaborator Author

Skipping over for the moment, but not ignoring. Please see next reply.

when 0 < timeout < infinity in which case it should proceed with the kevent64() call.

Thank you for the succinct test. I added it to TcpSuite.scala and hand tested
on all four cells of (Linux x macOS) x (JVM x Native). As expected macOS Native
failed with this code and mainline kexec. I then manually ran in the context of this WIP
and all four cells passed. CI confirms that.

I had gotten 90% of the way towards submitting the TcpSuite.scala now in this
WIP to mainline when I realized that it would hang mainline CI beyond the 20 second
test timeout, Duh!

@LeeTibbert
Copy link
Collaborator Author

when 0 < timeout < infinity in which case it should proceed with the kevent64() call.

To synchronize:

Should I proceed with this WIP and add the code which you suggested, still as a WIP?

Should I put this on the backburner, or out to feed the dog because another approach
is better or because we should just plain wait for experience & better understanding?

Please advise. I am invested in overall progress, not any one particular WIP/PR.

Given what you have said, I think I can work around this for the "IPv6 initialization" work
by "Just DON'T do that!".

I will read the "event-loop algorithm" URL, hopefully this weekend.

@armanbilge
Copy link
Owner

This is a legitimate issue, I'd like to fix it and get a patch out. If you don't mind I'll take the steering wheel on this one since it's interacting with the event loop. Do you mind if I push to your branch or shall I open a new PR?

@armanbilge
Copy link
Owner

Btw thanks for manually testing against the matrix, that was really helpful. Always good to have a test in place.

@LeeTibbert
Copy link
Collaborator Author

This is a legitimate issue, I'd like to fix it and get a patch out. If you don't mind I'll take the steering wheel
on this one since it's interacting with the event loop.

Understood & agreed, you have the conn (phew!). Time for the A team.

I will end this recursive edit and return to "IPv6 initialization".

Do you mind if I push to your branch or shall I open a new PR?

Since this is WIP and scratch work, I will stop pushing to it and hand it over to you.
That should make your startup easier.

We probably concur that the final submission probably wants to be a "clean" PR.

@LeeTibbert
Copy link
Collaborator Author

A "grasshopper" question, if you know of the top of your head. Not worth you spending time else-wise.
Thank you.

Where do I find the documentation for ".use" and ".use_". I have been searching the "cats effect 3.0" and "fs2"
documentation with no success. I spent a few hours figuring that the run through the documentation would
teach me things.

You had given me the location for ".async_". I stumbled across it today in plain sight and bold as brass in
the "Cats Effect" 3.3 documentation.

@LeeTibbert
Copy link
Collaborator Author

I added the test for ServerSocket because it causes a listen() call which might/complicate
the paths. This was before I read you friendly "please pass-the-baton" message.

@armanbilge
Copy link
Owner

Where do I find the documentation for ".use" and ".use_". I have been searching the "cats effect 3.0" and "fs2"
documentation with no success. I spent a few hours figuring that the run through the documentation would
teach me things.

They are methods on Resource. They are somewhat equivalent to the Java "try-with-resources" or Scala Using.

https://github.com/typelevel/cats-effect/blob/f12739b6af9173e8ca8ddaa35fb66f48bedde44f/kernel/shared/src/main/scala/cats/effect/kernel/Resource.scala#L197-L207

https://github.com/typelevel/cats-effect/blob/f12739b6af9173e8ca8ddaa35fb66f48bedde44f/kernel/shared/src/main/scala/cats/effect/kernel/Resource.scala#L226-L229


I added the test for ServerSocket because it causes a listen() call which might/complicate
the paths. This was before I read you friendly "please pass-the-baton" message.

No problem, great addition!


We probably concur that the final submission probably wants to be a "clean" PR.

Eh, I don't really care. I think it's good to preserve work-product so we can retrace our steps more easily.

@LeeTibbert
Copy link
Collaborator Author

re: kernel/Resource.scala URLs

Thank you. That is what I needed to get unstuck. Helps a number of pieces
(including evalTap()) fall into place. That file seems very well commented
(and the basis for other doc), so I will have to trace and study it. Homework on
a weekend, grumble (but not as much of a grumble as not knowing what
I am tracing ;-))

@armanbilge armanbilge changed the title WIP: Fix #64: Remove a KqExec poll hang Fix hang in KqueueExecutorScheduler Sep 18, 2022
Copy link
Owner

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😏

@armanbilge armanbilge merged commit b461ee1 into armanbilge:main Sep 18, 2022
@LeeTibbert
Copy link
Collaborator Author

Arman, Thank you for sorting this out, and so quickly.

@armanbilge
Copy link
Owner

Thank you for catching this bug and your initial investigation! I'm glad we could get a patch out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KqueueExecutorSchedule#poll should not hang forever if no work to do
2 participants