Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflow when multiple Managed Resources are being cleaned up at the same time #10211

Closed
radeusgd opened this issue Jun 7, 2024 · 8 comments · Fixed by #10602
Closed
Assignees
Labels

Comments

@radeusgd
Copy link
Member

radeusgd commented Jun 7, 2024

Try running the following script:

from Standard.Base import all
import Standard.Base.Runtime.Managed_Resource.Managed_Resource
import Standard.Base.Runtime.Ref.Ref

type My_Resource
    Value counter:Ref

    close self =
        self.counter.modify (x-> x-1)
        Nothing

    allocate counter:Ref =
        counter.modify (+1)
        Managed_Resource.register (My_Resource.Value counter) close_resource

close_resource resource = resource.close

repeat_cleanup_until_done counter =
    go i =
        if counter.get == 0 then Nothing else
            if i % 100 == 0 then
                IO.println "Still "+counter.get.to_text+" resources to clean up..."
            Runtime.gc
            @Tail_Call go i+1
    go 1

main =
    n = 10
    counter = Ref.new 0
    IO.println "Allocating resources..."
    0.up_to n . each _->
        My_Resource.allocate counter
    
    IO.println "Cleaning up..."
    repeat_cleanup_until_done counter
    IO.println "All cleaned up! "+counter.get.to_text

With n = 10 it will happily allocate and then clean up resources:

Allocating resources...
Cleaning up...
All cleaned up! 0

Now, try changing n to 10000:

-    n = 10
+    n = 10000

and running it again.

I'm consistently getting a StackOverflow failure:

Allocating resources...
Cleaning up...
Execution finished with an error: Resource exhausted: Stack overflow
@radeusgd
Copy link
Member Author

radeusgd commented Jun 7, 2024

As mentioned on Discord I think the problem is that the finalizer of a resource runs Enso code, which polls safepoints. Then in that safepoint, another finalizer is scheduled to be run. If there's lots of pending finalizers scheduled to run, each runs inside of another, creating a cascade of finalizers running on top of one another, pumping up the stack a lot and causing the overflow.

Instead, we should ensure that only one finalizer shall run at once. The code of the finalizer should probably still be polling safepoints (for all the other purposes), but as long as a finalizer is entered, no other finalizer should start inside of it - instead it should be enqueued and run once the first finalizer finishes.

@radeusgd
Copy link
Member Author

radeusgd commented Jun 7, 2024

Instead of the log above, I'm also sometimes getting the following error:

Allocating resources...
Cleaning up...
Execution finished with an error: java.lang.NoClassDefFoundError: Could not initialize class com.oracle.truffle.api.interop.InteropLibraryGen$Default$Uncached
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$Default.createUncached(InteropLibraryGen.java:555)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$Default.createUncached(InteropLibraryGen.java:546)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.library.LibraryFactory.getUncachedSlowPath(LibraryFactory.java:413)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.library.LibraryFactory.getUncached(LibraryFactory.java:404)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$UncachedDispatch.isException(InteropLibraryGen.java:7594)
        at <java> org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThreadLocalActions$AbstractTLHandshake.accept(PolyglotThreadLocalActions.java:609)
        at <java> org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThreadLocalActions$AbstractTLHandshake.accept(PolyglotThreadLocalActions.java:546)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake$Handshake.perform(ThreadLocalHandshake.java:219)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.processHandshakes(ThreadLocalHandshake.java:368)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake.processHandshake(ThreadLocalHandshake.java:159)
        at <java> org.graalvm.truffle.runtime/com.oracle.truffle.runtime.hotspot.HotSpotThreadLocalHandshake.poll(HotSpotThreadLocalHandshake.java:79)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.TruffleSafepoint.poll(TruffleSafepoint.java:155)
        at <java> org.enso.runtime/org.enso.interpreter.node.ClosureRootNode.execute(ClosureRootNode.java:83)
        at <enso> resource-overflow.repeat_cleanup_until_done.go<arg-0>(Internal)
        at <enso> resource-overflow.repeat_cleanup_until_done.go<arg-2>(resource-overflow.enso:24:13-29)
        at <enso> resource-overflow.repeat_cleanup_until_done.go(resource-overflow.enso:20-24)
        at <enso> resource-overflow.main(resource-overflow.enso:35:5-37)

@JaroslavTulach JaroslavTulach self-assigned this Jun 7, 2024
mergify bot pushed a commit that referenced this issue Jun 10, 2024
- Closes #10148
- [x] Tests for `Restartable_Input_Stream`, `peek_bytes` and `skip_n_bytes`.
- [x] Report `Managed_Resource` stack overflow bug: #10211
- [x] Followup possible optimization: #10220
- [x] Test use-case from blog.
@Akirathan
Copy link
Member

Akirathan commented Jun 10, 2024

Possibly related StackOverflowError in Table_Tests in https://github.com/enso-org/enso/actions/runs/9451385966/job/26032192518?pr=10192#step:7:1712. I could not reproduce that one locally.

@JaroslavTulach JaroslavTulach moved this from ❓New to 📤 Backlog in Issues Board Jul 9, 2024
@JaroslavTulach
Copy link
Member

@JaroslavTulach JaroslavTulach moved this from 📤 Backlog to ⚙️ Design in Issues Board Jul 19, 2024
@JaroslavTulach
Copy link
Member

JaroslavTulach commented Jul 19, 2024

Here is a patch that fixes the program that demonstrates the inquiry:

diff --git engine/runtime/src/main/java/org/enso/interpreter/runtime/ResourceManager.java engine/runtime/src/main/java/org/enso/interpreter/runtime/ResourceManager.java
index a2304b06d1..24b7f0a4f5 100644
--- engine/runtime/src/main/java/org/enso/interpreter/runtime/ResourceManager.java
+++ engine/runtime/src/main/java/org/enso/interpreter/runtime/ResourceManager.java
@@ -6,20 +6,20 @@ import com.oracle.truffle.api.interop.InteropLibrary;
 import java.lang.ref.PhantomReference;
 import java.lang.ref.Reference;
 import java.lang.ref.ReferenceQueue;
+import java.util.ArrayList;
+import java.util.List;
 import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.ConcurrentMap;
-import java.util.concurrent.Future;
 import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicInteger;
-import java.util.concurrent.atomic.AtomicReference;
 import org.enso.interpreter.runtime.data.ManagedResource;
 
 /** Allows the context to attach garbage collection hooks on the removal of certain objects. */
-public class ResourceManager {
+public final class ResourceManager {
   private final EnsoContext context;
   private volatile boolean isClosed = false;
   private volatile Thread workerThread;
-  private final Runner worker = new Runner();
+  private final ProcessItems worker = new ProcessItems(false, false, false);
   private final ReferenceQueue<ManagedResource> referenceQueue = new ReferenceQueue<>();
   private final ConcurrentMap<PhantomReference<ManagedResource>, Item> items =
       new ConcurrentHashMap<>();
@@ -98,21 +98,8 @@ public class ResourceManager {
         // no further attempts are made.
         boolean continueFinalizing = it.isFlaggedForFinalization().compareAndSet(true, false);
         if (continueFinalizing) {
-          var futureToCancel = new AtomicReference<Future<Void>>(null);
-          var performFinalizeNow =
-              new ThreadLocalAction(false, false, true) {
-                @Override
-                protected void perform(ThreadLocalAction.Access access) {
-                  var tmp = futureToCancel.getAndSet(null);
-                  if (tmp == null) {
-                    return;
-                  }
-                  tmp.cancel(false);
-                  it.finalizeNow(context);
-                  items.remove(it);
-                }
-              };
-          futureToCancel.set(context.submitThreadLocal(null, performFinalizeNow));
+            it.finalizeNow(context);
+            items.remove(it);
         }
       }
     }
@@ -174,12 +161,57 @@ public class ResourceManager {
   }
 
   /**
-   * The worker action for the underlying logic of this module. At least one such thread must be
-   * spawned in order for this module to be operational.
+   * Processes {@link Item}s eligible for GC. Plays two roles. First of
+   * all cleans {@link #referenceQueue} in {@link #run()} method running in
+   * its own thread. Then it invokes finalizers in {@link #perform} method
+   * inside of Enso execution context.
    */
-  private class Runner implements Runnable {
+  private final class ProcessItems extends ThreadLocalAction implements Runnable {
+    /** @GuardedBy("pendingItems") */
+    private final List<Item> pendingItems = new ArrayList<>();
     private volatile boolean killed = false;
 
+    ProcessItems(boolean hasSideEffects, boolean synchronous, boolean recurring) {
+        super(hasSideEffects, synchronous, recurring);
+    }
+
+    /**
+     * Runs at a safe point in middle of regular Enso program execution.
+     * Gathers all available {@link #pendingItems} and runs their finalizers.
+     * Removes all processed items from {@link #pendingItems}. If there are
+     * any remaining, continues processing them. Otherwise finishes.
+     * 
+     * @param access not used for anything
+     */
+    @Override
+    protected void perform(ThreadLocalAction.Access access) {
+        for (;;) {
+            Item[] toProcess;
+            synchronized (pendingItems) {
+                toProcess = pendingItems.toArray(Item[]::new);
+            }
+            try {
+                for (var it : toProcess) {
+                    scheduleFinalizationAtSafepoint(it);
+                }
+            } finally {
+                synchronized (pendingItems) {
+                    pendingItems.subList(0, toProcess.length).clear();
+                    if (pendingItems.isEmpty()) {
+                        return;
+                    }
+                    // continue processing meanwhile added pendingItems
+                }
+            }
+        }
+    }
+
+    /**
+     * Running in its own thread. Waiting for {@link #referenceQueue} to
+     * be populated with GCed items. Scheduling {@link #perform}
+     * action at safe points while passing the {@link Item}s to it via
+     * {@link #pendingItems}.
+     */
     @Override
     public void run() {
       while (true) {
@@ -188,7 +220,12 @@ public class ResourceManager {
           if (!killed) {
             if (ref instanceof Item it) {
               it.isFlaggedForFinalization().set(true);
-              scheduleFinalizationAtSafepoint(it);
+                synchronized (pendingItems) {
+                    if (pendingItems.isEmpty()) {
+                        context.submitThreadLocal(null, this);
+                    }
+                    pendingItems.add(it);
+                }
             }
           }
           if (killed) {

with this change the above program finishes without errors even if n=1000000.

@JaroslavTulach JaroslavTulach moved this from ⚙️ Design to 👁️ Code review in Issues Board Jul 19, 2024
@enso-bot
Copy link

enso-bot bot commented Jul 20, 2024

@mergify mergify bot closed this as completed in #10602 Jul 22, 2024
@mergify mergify bot closed this as completed in b6bbfc5 Jul 22, 2024
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Jul 22, 2024
@enso-bot
Copy link

enso-bot bot commented Jul 23, 2024

Jaroslav Tulach reports a new STANDUP for yesterday (2024-07-22):

Progress: - Analyzing JNA usage: #10440 (comment)

@enso-bot
Copy link

enso-bot bot commented Jul 24, 2024

Jaroslav Tulach reports a new STANDUP for yesterday (2024-07-23):

Progress: - planning & discussions It should be finished by 2024-07-23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants