Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add virtual threads support #224

Merged
merged 35 commits into from
Mar 6, 2024
Merged

Conversation

kawamuray
Copy link
Member

@kawamuray kawamuray commented Jan 17, 2024

This PR aims to introduce Virtual Thread support, which we expect to have very high affinity with a typical workload implemented on Decaton (I/O heavy).
We are providing DeferredCompletion (async processing) to support DecatonProcessor implementation leveraging asynchronous paradigm, but it has been causing completion leak and then consumption stuck, where the virtual thread could be a perfect solution to eliminate the whole problem.

As this involves a lot changes in existing code base, here are the (likely) complete list of changes made out other than the newly added virtual thread support:

  • Logback version upgrade - necessary to cope with vthread
  • Gradle upgrade - necessary to build/run with java 21 or higher
  • Use of synchronized => ReentrantLock - vthread and synchronized does not work well (pinning)
  • Merge TaskMetrics and ProcessMetrics as PerPartitionMetrics for better organization
  • Deprecate AsyncShutdownable interface and added AsyncClosable for refactoring
  • PartitionProcessor is now called SubPartitions which is an interface with possible many implementations. Not intending to let users implement this interface though.
  • Add --latency-count option for benchmark. This is for simulating multiple IO during processing a task to count context switching costs in.
  • Benchmark's DecatonProcessor scope has been changed from THREAD to PROVIDED for fair result on VIRTUAL_THREAD runtime
  • Benchmark run now runs with fixed heap size (8gb), -server option and -Xcomp option to minimize impact from JIT compilation
  • Change async-profiler to be applied twice for benchmark, this is to avoid the real run measurement interfered by deoptimization caused by loading JVMTI agent
  • Remove any micrometer meters instantiations in ProcessorUnit, as it found to be very expensive by profiling

About the performance, I ran several benchmark using the benchmark module.
The benchmark simulates workload with 5 I/Os (5 context switches forced at least), with varying I/O latency (off-cpu).

First, to compare the maximum performance producable from current THREAD_POOL runtime, I tried to find the optimal count to set for decaton.partition.concurrency.

Command:

    ./debm.sh \
        --title "Decaton" \
        --runner com.linecorp.decaton.benchmark.DecatonRunner \
        --runs 2 \
        --format json \
        --tasks 10000 \
        --warmup 100000 \
        --simulate-latency=4 \
        --latency-count=5 \
        --param=decaton.partition.concurrency=$conc \
        --param=decaton.subpartition.runtime=THREAD_POOL | tee vthread-bm/threadpool-conc-$conc.json

Result:
image

Found out decaton.partition.concurrency=300 is the setting of peak performance.

Then I ran the same workload with increasing I/O latency against the THREAD_POOL mode and VIRTUAL_THREAD mode.
Command:

    ./debm.sh \
        --title "Decaton" \
        --runner com.linecorp.decaton.benchmark.DecatonRunner \
        --runs 3 \
        --format json \
        --tasks 10000 \
        --warmup 100000 \
        --simulate-latency=$latency \
        --latency-count=5 \
        --param=decaton.partition.concurrency=300 \
        --param=decaton.subpartition.runtime={THREAD_POOL|VIRTUAL_THREAD}

Result:
image

As the above charts shows, THREAD_POOL runtime decreases performance as the total I/O latency increases, but with the VIRTUAL_THREAD runtime throughput remains fairly stable and mitigates impact from I/O latency.

@kawamuray kawamuray requested a review from ocadaruma January 17, 2024 11:21
@ocadaruma
Copy link
Member

@kawamuray Should I start review while the PR is still in draft state?

@kawamuray
Copy link
Member Author

@ocadaruma yes please, from rough design discussions perhaps.

Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only nits comment.
I didn't review test-code and benchmark's code in detail but overall looks good!

Can't wait for releasing this feature!

One concern is CI on jdk 8/11/17.
IMO we still should run tests on these versions because we have many users still run these.

After this PR got merged, I plan to submit a follow-up PR to bring back 8/11/17 test env

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

public class ConcurrentHashSet<E> implements Set<E> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using ConcurrentHashMap.newKeySet ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I didn't know that API existed, good to know, will fix that.

@kawamuray kawamuray marked this pull request as ready for review February 9, 2024 08:53
@kawamuray kawamuray requested a review from ocadaruma February 10, 2024 01:55
Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from comments, AveragingRateLimiter still uses synchronized. We should get rid of it too.

Thread.sleep(rand.nextInt(10));
}))
.propertySupplier(StaticPropertySupplier.of(
Property.ofStatic(ProcessorProperties.CONFIG_PARTITION_CONCURRENCY, 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In VThreadCoreFunctionalityTest, no point to specify partition concurrency right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will remove that.

.publishPercentiles(0.5, 0.9, 0.99, 0.999)
.register(registry));

public class PerPartitionMetrics extends AbstractMetrics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for merging TaskMetrcis and ProcessMetrics but PerPartitionMetrics sounds confusing because it sounds like the class containing all partitionScope metrics despite it's not.
Since all metrics inside this class is about tasks, how about TaskMetrics ?

protected final PerPartitionMetrics perPartitionMetrics;
protected final SchedulerMetrics schedulerMetrics;

public AbstractSubPartitions(PartitionScope scope, Processors<?> processors) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nits] Let's use protected since this is an abstract class

try {
return unit.asyncClose()
.thenApply(ignored -> null) // To migrate type from Void to Object
.completeOnTimeout(TIMEOUT_INDICATOR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completeOnTimeout is introduced in Java9 so shouldn't be used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ogh, good catch. What to do to make this part doesn't look messed up then ...

@@ -68,36 +65,33 @@ private void processTask(TaskRequest request) {
return;
}

Timer timer = Utils.timer();
CompletionStage<Void> processCompletion = CompletableFuture.completedFuture(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be a problem but this creates unnecessary CompletableFuture.completedFuture instance for every task even when a task doesn't end up with exception.

Let's supply completedFuture inside catch block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

private Processors<?> processors;

@Test
public void testCleanupPartiallyInitializedUnits() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test is deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kawamuray
Copy link
Member Author

Apart from comments, AveragingRateLimiter still uses synchronized. We should get rid of it too.

That's right. I knew it but left it as-is with intention such that it would have almost no risk to conflict with vthread.
I think replacing all usage of synchronized with locks has the pros of that it eliminates risk of conflicting with vthread, while it also has cons as basically more lines of code to do the same thing.
So I thought for methods with reasonably small, closed and has no blocking work inside, rarely it is better to leave synchronized as-is.
Different thoughts?

image

@ocadaruma
Copy link
Member

So I thought for methods with reasonably small, closed and has no blocking work inside, rarely it is better to leave synchronized as-is.

Thank you for the explanation. Fair enough. Sounds good to keep synchronized then

@kawamuray
Copy link
Member Author

Applied all feedback. PTAL

@kawamuray kawamuray requested a review from ocadaruma February 19, 2024 05:38
Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only minor comment

}
ScheduledFuture<?> cancelFut = scheduledExecutor.schedule(() -> {
if (!cf.isDone()) {
cf.complete(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this completion handler may include destroyThreadProcessor call which is potentially costly, shouldn't we have more pool size than 1 ? (like available processor num ?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, I think the first argument is "corePoolSize", so my understanding of the behavior is, when there's a running scheduled task at the time of next scheduled execution occurs, the executor creates a new thread for processing it (and it's count is unbound) so a scheduled task execution never blocks the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think your concern is correct. fixing.

@kawamuray kawamuray requested a review from ocadaruma February 29, 2024 04:39
Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Magnificent!

@ocadaruma ocadaruma merged commit 83be0df into line:master Mar 6, 2024
2 checks passed
@kawamuray kawamuray added new feature Add a new feature breaking change Breaking change for a public API labels Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Breaking change for a public API new feature Add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants