Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace TaskMaster with Reactor #24

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from

Conversation

pollend
Copy link
Member

@pollend pollend commented Jun 27, 2021

What this PR does

  • replace one usage of TaskMaster with GameScheduler, creating and running JPSImpl on a separate thread
  • removing the other usage of TaskMaster (via TimeLimiter logic)
    • NOTE: We're not entirely clear yet about the performance impact of this, as path searches no longer time out after 3 seconds.

Test Plan

Do the following once for each of the FlexiblePathfinding#develop and FlexiblePathfinding#refactor/rework-concurrency-Reactivex branches:

  1. Start CoreGameplay world with added WildAnimals module and deps
  2. Spawn 42 deer using the in-game console command spawnPrefab deer
  3. Open in-game analytics (F3) and check FPS

To Do

  • test plan to check for potential performance impact in searching for valid paths

PR: (MovingBlocks/Terasology#4798, MovingBlocks/Terasology#4799)

@keturn keturn changed the title refactor: rework concurrency with reactiveX refactor: replace TaskMaster with Reactor Nov 5, 2021
Copy link
Contributor

@keturn keturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that once we resolve the concern I've noted with JPSImpl.timeLimiter, this is a fine stepping stone on the Reactor path. It'll let us move on with removing the TaskMaster class while not disturbing any more than it needs to.

If anyone is actually maintaining this module, they'll want to do another pass later that replaces the old callback-based interfaces with Reactor-compatible asynchronous functions.

Comment on lines 67 to 69
boolean result = Boolean.TRUE.equals(Mono.fromCallable(this::performSearch)
.subscribeOn(GameScheduler.parallel())
.block(Duration.ofMillis((long) (config.maxTime * 1000.0f))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be very careful about when we call block, right?

This looks weird, because if I understand correctly, this JPSImpl#run is called by PathfinderSystem#processPath, which was already sent to run on GameScheduler.parallel() by PathfinderSystem#requestPath.

So this is executing on the parallel scheduler, and it adds a new thing to that scheduler, and then blocks one of that scheduler's threads—that sounds like a good way to get a logjam, requiring later tasks to complete before earlier ones yield.

It also looks like the old implementation only did that timeLimiter stuff if config.executor was non-null. Can we get away with dropping that feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do that in another PR. this is just migrating the existing code over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I found where it set config.executor: it was in PathfinderSystem.requestPath. It looks like it did have a second task queue for that purpose. That means this replacement where it's using the same scheduler for both is not equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the removal of the TimeLimiter usage as I don't understand why it is necessary for replacing TaskMaster with Reactor.
Also, I believe I'm using a new thread for the code path without the time limiter now, so the issue @skaldarnar reported below should be gone. At least when testing this in CoreGameplay + WildAnimals by spawning a deer, I didn't get any exceptions as far as I can see.

pollend and others added 2 commits November 7, 2021 09:43
…ncurrency-Reactivex

# Conflicts:
#	src/main/java/org/terasology/flexiblepathfinding/JPSImpl.java
#	src/main/java/org/terasology/flexiblepathfinding/PathfinderSystem.java
#	src/main/java/org/terasology/flexiblepathfinding/PathfinderTask.java
#	src/main/java/org/terasology/flexiblepathfinding/ShutdownTask.java
@keturn
Copy link
Contributor

keturn commented Jan 21, 2022

@skaldarnar
Copy link
Contributor

Did a quick test run with after the migration with the WildAnimal:deer and critter behavior.

22:33:57.853 [main] DEBUG o.t.m.behaviors.actions.LogAction - Actor 228205: ---- critter ----
22:33:57.853 [main] DEBUG o.t.m.behaviors.actions.LogAction - Actor 228205: in stray Behavior
22:33:57.853 [main] DEBUG o.t.m.behaviors.actions.LogAction - Actor 228205: in doRandomMove Behavior
22:33:57.853 [main] DEBUG o.t.m.b.a.SetTargetToNearbyBlockNode - Actor 228205: in set_target_nearby_block Action
22:33:57.853 [main] DEBUG o.t.m.b.a.SetTargetToNearbyBlockNode - ... [228205]  start position: (-1.500E+1  3.000E+1 -1.000E+0)
22:33:57.853 [main] DEBUG o.t.m.b.a.SetTargetToNearbyBlockNode - ... [228205] target position: (-1.500E+1  3.000E+1 -1.000E+0) - distance: 0.0
22:33:57.853 [main] DEBUG o.t.m.behaviors.actions.LogAction - Actor 228205: in naiveMoveTo Behavior
22:33:57.853 [main] DEBUG o.t.m.b.actions.FindPathToNode - Actor 228205: construct find_path Action
22:33:57.853 [main] DEBUG o.t.m.b.actions.FindPathToNode - ... [228205]: compute path between (-1.500E+1  3.000E+1 -1.000E+0) -> (-1.500E+1  3.000E+1 -1.000E+0)
22:33:57.853 [main] DEBUG o.t.m.b.actions.FindPathToNode - Actor 228205: in find_path Action
22:33:57.853 [main] DEBUG o.t.m.b.actions.FindPathToNode - ... [228205]: ... still searching for path
22:33:57.853 [parallel-15] ERROR o.t.flexiblepathfinding.JPSImpl - Timeout of 10.0
java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread parallel-15
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:117)
        at reactor.core.publisher.Mono.block(Mono.java:1730)
        at org.terasology.flexiblepathfinding.JPSImpl.run(JPSImpl.java:69)
        at org.terasology.flexiblepathfinding.PathfinderSystem.processPath(PathfinderSystem.java:79)
        at org.terasology.flexiblepathfinding.PathfinderSystem.lambda$requestPath$0(PathfinderSystem.java:98)
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
        at io.micrometer.core.instrument.internal.TimedCallable.call(TimedCallable.java:46)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

The log about a "Timeout of 10.0" is misleading here (see correction in #26).

@keturn
Copy link
Contributor

keturn commented Jan 25, 2022

That looks like what I was worried about.

@jdrueckert jdrueckert requested review from keturn, jdrueckert, DarkWeird and skaldarnar and removed request for jdrueckert October 23, 2022 16:54
@jdrueckert jdrueckert self-assigned this Oct 23, 2022
@keturn
Copy link
Contributor

keturn commented Oct 23, 2022

Gave it a test run under JS.

The good news:

  • it runs
  • I see deer
  • the deer move (if I dig them out from the ground they seem to be stuck in)
  • the test suite passes

The bad news:

  • it's good at using > 80% of all CPUs when I'm just standing still and there are no chunks generating. When there are chunks generating, it's unplayable.
  • it ran out of memory and crashed in just under 15 minutes.

Point for comparison: In Core Gameplay, CPU avg across all CPUs is more like 10%, and memory usage is stable.

but I guess for PR purposes, the relevant point for comparison is the state of the same game configuration with and without the PR, not compared to Core Gameplay. 🤔

@keturn
Copy link
Contributor

keturn commented Oct 23, 2022

More testing:

  • develop did not have the high-CPU-while-idle after the chunks were done generating
  • then I tried loading the save from that game under this branch, and it seemed not-so-bad
  • wondering if it was some kind of "new spawns vs loading from save" thing, I ran with --create-last-game
    • but wild animals spawns are not reproducible, so there weren't deer in sight, so I wasn't able to compare 😩

@jdrueckert
Copy link
Member

@keturn Regarding spawns, you can always use spawnPrefab deer from the in-game console.

public int requestPath(JPSConfig config, PathfinderCallback callback) {
if (config.requester != null && config.requester.exists()) {
if (entitiesWithPendingTasks.contains(config.requester)) {
return -1;
}
entitiesWithPendingTasks.add(config.requester);
}

if (config.executor == null) {
config.executor = workerTaskMaster.getExecutorService();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question I have left on this review is: If this assignment of config.executor goes away, is there anything left using config.executor at all?

and if not, does that mean we never use TimeLimiter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably that's the case (although we should double-check), yes.
Question for me is, if we don't use it, do we create a negative impact on performance? And if so, how can we model the timeout logic using Reactor / GameScheduler?

@jdrueckert
Copy link
Member

As discussed in today's contributor meeting: let's note down a test plan to check for potential performance impact in searching for valid paths and remove the time limiter logic with a comment to reintroduce it or remodel it with Reactor in case of issues.

- unclear performance impact without timeout logic
- if performance issues arise, reintroduce or reimplement with Reactor
…tivex' into refactor/rework-concurrency-Reactivex
@jdrueckert
Copy link
Member

So my first performance test plan was simply to spawn 42 deer in a CoreGameplay + WildAnimals module combo and look at the FPS...

Good news: With the current FlexiblePathfinding develop, I have 60 FPS
Bad news: With this branch checked out, I have 1.7 FPS - and no, the decimal point in there is no accident...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants