Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Win32 event loop #11776

Closed
wants to merge 1 commit into from
Closed

Conversation

straight-shoota
Copy link
Member

This patch adds an implementation of the event loop for windows.
It seems to be working in general, but there is some bug which I haven't figured out yet. I've had this branch sitting around for a long time but have not been able to finalize it. I'm posting it here to maybe get a set of fresh eyes on it. Maybe someone else would be able to spot what's going on.

More detailed history and some debug output is available in the branch on my fork: https://github.com/straight-shoota/crystal/tree/feature/win32-event_loop-wip

This is the example debug output from tcp_socket_spec. Execution stops while printing a fiber's inspect. The reason seems to be that the fiber value is invalid. It is boxed in OverlappedOperation#@fiber for passing around to the Win32 API. For some reason, it returns an invalid value.

enqueing #<Fiber:0x273bd827ea0: main>
@runnables # => Deque{}
run_once                                
fiber # => #<Fiber:0x273bd827ea0: main>
timed_out  # => false
sleep_time # => 00:00:00.998693000
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
resuming #<Fiber:0x273bd827ea0: main>
enqueing #<Fiber:0x273bd827ea0: main>
@runnables # => Deque{}
run_once
fiber # => #<Fiber:0x273bd827ea0: main>
timed_out  # => false
sleep_time # => 106751991167300.00:11:44.377945499
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
resuming #<Fiber:0x273bd827ea0: main>
.@runnables # => Deque{}
@runnables # => Deque{}
@runnables # => Deque{}
run_once
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
enqueing #<Fiber:0x273bd827ea0: main>
@runnables # => Deque{}
run_once
timed_out  # => true
sleep_time # => 00:00:00.999389200
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
resuming #<Fiber:0x273bd827ea0: main>
closing, Error connecting:ConnectEx: Überlappendes E/A-Ereignis befindet sich nicht in einem signalisierten Zustand.
.@runnables # => Deque{}
@runnables # => Deque{}
@runnables # => Deque{}
run_once
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
.@runnables # => Deque{}
@runnables # => Deque{}
@runnables # => Deque{}
run_once
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
closing, Error connecting:ConnectEx: Die angeforderte Adresse ist in diesem Kontext ungültig.
.@runnables # => Deque{}
@runnables # => Deque{}
@runnables # => Deque{}
run_once
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
enqueing #<Fiber:0x273bd827ea0: main>
@runnables # => Deque{}
run_once
E@runnables # => Deque{}
@runnables # => Deque{}
@runnables # => Deque{}
run_once
@runnables # => Deque{#<Fiber:0x273bd827ea0: main>}
runnable # => #<Fiber:0x273bd827ea0: main>
enqueing #<Fiber:0x273bd827ea0: main>
@runnables # => Deque{}
run_once
fiber # => #<Fiber:0x60000001c

@straight-shoota straight-shoota added kind:feature topic:stdlib:concurrency platform:windows Windows support based on the MSVC toolchain / Win32 API labels Jan 27, 2022
Copy link
Contributor

@yxhuvud yxhuvud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've noticed can create invalid fibers is if a fiber is enqueued that has already run to completion. Are you properly cleaning up all pending events?


Crystal::Scheduler.reschedule

Crystal::EventLoop.dequeue(timeout_event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't run_once already dequeue? If not, seems weird that it dequeues sometimes, but not always.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_once deques when a timeout happens. This operations handles the case when there's no timeout, i.e. the operation completes in time and the fiber resumes. Then we need to dequeue the timeout event.

overlapped_entries = uninitialized LibC::OVERLAPPED_ENTRY[1]

if timeout > UInt64::MAX
timeout = LibC::INFINITE
Copy link
Contributor

@yxhuvud yxhuvud Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant right now probably as this is the single threaded version of crystal (right?), but at some point this would probably would be clamped downwards towards a short duration rather than towards infinity. This because otherwise if a fiber is woken up from waiting on a channel, then if the fiber is assigned to a thread that is in an infinite wait won't process the woken fiber until the completion return. Alternatively there could be some way to way to wake a specific waiter. I've seen a way to break out of the completion wait but only for any arbitrary waiter when i looked up how to do Thread#notify on win32, not for a specific thread. But perhaps there are some better way.

end

# Frees the event
def free : Nil
Crystal::EventLoop.dequeue(self)
end

def delete
free
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is another dequeue. From where is this called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's called from Fiber#cancel_timeout which is used for the select implementation.

@HertzDevil
Copy link
Contributor

I could get all the specs passing by allocating IO::Overlapped::OverlappedOperation on the heap instead of the stack:

diff --git a/src/io/overlapped.cr b/src/io/overlapped.cr
index 0d0ea2cc0..da4fb6c83 100644
--- a/src/io/overlapped.cr
+++ b/src/io/overlapped.cr
@@ -81,7 +81,6 @@ module IO::Overlapped
     false
   end
 
-  @[Extern]
   struct OverlappedOperation
     @overlapped : LibC::WSAOVERLAPPED
     @fiber : Void*
@@ -98,12 +97,12 @@ module IO::Overlapped
 
   def create_operation
     overlapped = LibC::WSAOVERLAPPED.new
-    OverlappedOperation.new(overlapped, Fiber.current)
+    Box.new(OverlappedOperation.new(overlapped, Fiber.current))
   end
 
   def get_overlapped_result(socket, operation)
     flags = 0_u32
-    result = LibC.WSAGetOverlappedResult(socket, pointerof(operation).as(LibC::OVERLAPPED*), out bytes, false, pointerof(flags))
+    result = LibC.WSAGetOverlappedResult(socket, pointerof(operation.object.@overlapped), out bytes, false, pointerof(flags))
     if result.zero?
       error = WinError.wsa_value
       yield error
@@ -132,7 +131,7 @@ module IO::Overlapped
   def overlapped_operation(socket, method, timeout, connreset_is_error = true)
     operation = create_operation
 
-    result = yield pointerof(operation).as(LibC::OVERLAPPED*)
+    result = yield pointerof(operation.object.@overlapped)
 
     if result == LibC::SOCKET_ERROR
       error = WinError.wsa_value
@@ -157,9 +156,9 @@ module IO::Overlapped
   def overlapped_connect(socket, method)
     operation = create_operation
 
-    yield pointerof(operation).as(LibC::OVERLAPPED*)
+    yield pointerof(operation.object.@overlapped)
 
-    schedule_overlapped(read_timeout || 1.seconds)
+    schedule_overlapped(read_timeout || 5.seconds)
 
     get_overlapped_result(socket, operation) do |error|
       case error
@@ -177,7 +176,7 @@ module IO::Overlapped
   def overlapped_accept(socket, method)
     operation = create_operation
 
-    yield pointerof(operation).as(LibC::OVERLAPPED*)
+    yield pointerof(operation.object.@overlapped)
 
     unless schedule_overlapped(read_timeout)
       raise IO::TimeoutError.new("accept timed out")

Though I don't know if making IO::Overlapped::OverlappedOperation an extern struct has any significance.

@HertzDevil
Copy link
Contributor

HertzDevil commented Apr 3, 2022

Now if I run the specs with crystal spec spec\std\channel_spec.cr spec\std\socket\tcp_socket_spec.cr --verbose -Ddebug_raise I get this instead:

TCPSocket
  #connect
    using IPv4
      connects to server
      raises when connection is refused
Attempting to raise: 
BUG: Fiber was awaken without channel delivery state set (Exception)

For some reason the stack trace doesn't show here, hence the -Ddebug_raise.

@wonderix
Copy link
Contributor

I just wanted to share some insights that I gained during debugging this PR.

LibC::OVERLAPPED

On the Heap

LibC::OVERLAPPED needs to be allocated on the stack because calls to LibC.GetQueuedCompletionStatusEx might return references to these structures after the current fiber left the scope of these variables (for example after leaving overlapped_operation). Even cancelling the overlapped operation with CancelIoEx does not prevent this.

On the Stack

Even allocating LibC::OVERLAPPED on the stack might be not enough. If the garbage collector runs after the current fiber left overlapped_operation and before LibC.GetQueuedCompletionStatusEx it might happen that the LibC::OVERLAPPED gets deleted but it's still required by LibC.GetQueuedCompletionStatusEx. I don't think that the GC is aware of objects in the kernel space. Perhaps I'm wrong with this.

If the current fiber is leaving overlapped_operation it must also be ensured that it will be not scheduled by this overlapped operation. In the case where a call to LibC.GetQueuedCompletionStatusEx returns a LibC::OVERLAPPED which is already out of scope, yielding of the fiber must be disabled.

Proposal

See the latest commit on my branch win32-event_loop which is based on feature/win32-event_loop-wip with a rebase to master. I introduced a global hash OVERLAPPED_OPERATIONS which stores all fibers associated with an overlapped operation. If a fiber leaves the scope of the overlapped operation, it's removed from this hash. If the overlapped operation was not still processed by wait_queued_completions, the operation is canceled.

wait_queued_completions checks if the fiber is still contained in OVERLAPPED_OPERATIONS. Only in this case, the fiber is yielded.

For multi threaded environments OVERLAPPED_OPERATIONS should be placed in the current thread context.

Dead fibers

After these changes, I had a lot of problems to run crystal spec spec\std\channel_spec.cr spec\std\socket\tcp_socket_spec.cr --verbose. At the end it turned out that dead fibers are still queued in Crystal::EventLoop. After adding a getter for @alive to Fiber and using this getter to filter events in Crystal::EventLoop#run_once I got it to work. I'm quite sure that this is not the correct way. But the @timeout_event in the fiber was not set during deletion. So I took the quick path.

Hopefully my insights will be helpful in moving the event implementation on Windows forward.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jun 19, 2022

@wonderix Good to see someone looking into this. However, the whole point of overlapped_operation is that it doesn't return until the runtime has gotten back the async answer from the event loop. Together with the queue in the event loop it should solve all the reference counting your PR is adding.

So for example after leaving overlapped_operation simply shouldn't happen (unless there is a bug or a very fundamental misunderstanding in how the underlying event loop works). What happens is that it eventually calls reschedule and that will park the fiber until a response from the event loop wakes it up again.

Dead fibers being queued up sounds like the runtime is trying to wake up a fiber that has already been woken up. That absolutely must never happen. My guess are timeouts (or sleeps?) - are perhaps not all fibers removed from the event loop timeout queue when they are activated? If not, then that would mean they would get woken up again once the timeout hits, and that would trigger all described symptoms. Figuring out where this happens can be really challenging though.

@wonderix
Copy link
Contributor

wonderix commented Jun 20, 2022

However, the whole point of overlapped_operation is that it doesn't return until the runtime has gotten back the async answer from the event loop

The normal case looks like this:

But I'm quite sure that overlapped_operation will return in case of a timeout.

That's the point where things get complicated. In this case the LibC::OVERLAPPED structure must live until a call to LibC.GetQueuedCompletionStatusEx returns the structure. Even calling CancelIoEx will change this. This picture also illustrates why LibC::Overlapped and the suspended fiber should not be packaged into a single structure: the required lifetimes are different.

In the mean time, I realized that my implementation is also not save in this point. Perhaps it's necessary to allocate LibC::OVERLAPPED using GC#malloc_atomic and free it with GC#free after LibC.GetQueuedCompletionStatusEx returns it.

The dead fiber problem is probably caused by my rebase on master. I will try to describe this in more detail. The fiber is started in channel_spec.cr with a timeout condition timeout_select_action(0.1.seconds). The fiber runs to it's end as usual (Fiber#delete is called) but it's not removed from Crystal::EventLoop#@@queue.

@wonderix
Copy link
Contributor

wonderix commented Jun 21, 2022

I had a look at the ruby code. They are cancelling requests using CancelIo which is the synchronous variant of CancelIoEx. But it turned out, that it does not work like expected.

After this, I used a different approach, which is closer to the original PR. OverlappedOperation is now a class which is allocated on the heap. If the operation times out and cancellation is required the OverlappedOperation is added to a global list to ensure, that it's not freed by the garbage collector. If the cancellation is finished (the corresponding OVERLAPPED structure is returned from LibC.GetQueuedCompletionStatusEx) the OverlappedOperation is removed from the list.

I'm now quite happy with this approach. The only thing, which needs to be fixed is the fiber.dead? call in Crystal::EventLoop#run_once

The changes are available as last commit on win32-event-loop-2. Unfortunately, I'm not able to run all specs on windows.

> .\bin\crystal build  --verbose spec/win32_std_spec.cr  
Error: while requiring "./std/llvm/aarch64_spec.cr"

The 2 specs mentioned in the comments are working fine crystal spec spec\std\channel_spec.cr spec\std\socket\tcp_socket_spec.cr.

@straight-shoota, how should we proceed? Would you like to cherry-pick my changes? Should I open a PR including your changes?

@straight-shoota
Copy link
Member Author

That's awesome! Great to see this basically resolved. We might still need some enhancements, but now we have a working condition to start from.

At this point I'd prefer to open a fresh PR.

@asterite
Copy link
Member

asterite commented Jun 21, 2022

At this point I'd prefer to open a fresh PR

Just to avoid confusion, it's not clear who should open the PR. (or maybe yes? Does the "I'd" means you are going to do it, straight-shoota?)

@straight-shoota
Copy link
Member Author

Good point 😆
I don't mind either way. But I think it's probably better that I take care of getting it ready to merge.

@wonderix
Copy link
Contributor

That's absolutely OK for me.

@straight-shoota straight-shoota deleted the feature/win32-event_loop branch June 17, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:concurrency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants