New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement adaptive remote task request size #10013

Closed

yansun7 wants to merge 1 commit into trinodb:master from yansun7:adaptive-remote-task-request-size

Contributor

yansun7 commented Nov 19, 2021

Currently HttpRemotTask will send all pending splits in one taskUpdateRequest, sometimes causing large request for each task in coordinator.

This change will enable the ability to split large taskUpdateRequest into several smaller batch dynamically to send.

Introduced 4 parameters regarding to this feature:

query.remote-task.enable-adaptive-request-size

session: enable_adaptive_remote_task_request_size
to control the enable/disable of the feature

query.remote-task.max-request-size

session: max_remote_task_request_size
the max size of taskUpdateRequest

query.remote-task.request-size-headroom

session: remote_task_request_size_headroom
the expected mean value of the taskUpdateRequest will be between (max_remote_task_request_size - remote_task_request_size_headroom) and (max_remote_task_request_size)

query.remote-task.guaranteed-splits-per-task

session: remote_task_guaranteed_splits_per_request
guaranteed splits that will be sent in taskUpdateRequest for each task to prevent deadlock, regardless of max_remote_task_request_size

cla-bot bot added the cla-signed label

yansun7 force-pushed the adaptive-remote-task-request-size branch from a6b33e1 to 99595ec Compare

November 19, 2021 20:22

yansun7 changed the title ~~implement adaptive remote task request size~~ Implement adaptive remote task request size

yansun7 marked this pull request as draft

November 20, 2021 03:21

yansun7 force-pushed the adaptive-remote-task-request-size branch from 29d32ba to 2612574 Compare

November 22, 2021 07:34

yansun7 marked this pull request as ready for review

November 22, 2021 14:48


          Implement adaptive remote task request size

f77f287

yansun7 force-pushed the adaptive-remote-task-request-size branch from 2612574 to f77f287 Compare

November 23, 2021 17:25

martint requested a review from arhimondr

December 2, 2021 06:40

Contributor

arhimondr commented Dec 2, 2021 •

edited

Loading

sometimes causing large request for each task in coordinator.

How does this problem manifest itself? Could you please elaborate more on the problem you are trying to solve?

Member

JunhyungSong commented Apr 13, 2022

@arhimondr
The problematic situation:
Table has 3884 columns.
Table schema has a “parameters” field with the whole table schema string(~ 300k characters)
Size of each split object(in task update request) is big due to large strings in table properties.
If taskUpdateRequest contains a lot of huge splits like above and send it to the worker through http, it causes coordinator crash due to out of memory.
So, this PR tried to limit the size of taskUpdateRequest not only using max_unacknowledged_splits_per_task, but also using max_remote_task_request_size.

@yansun7 can you fix the merge conflicts, please?

JunhyungSong reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/QueryManagerConfig.java

@@ @@ -58,6 +58,11 @@ @@
                   private Duration remoteTaskMaxErrorDuration = new Duration(5, TimeUnit.MINUTES);
                   private int remoteTaskMaxCallbackThreads = 1000;
+                  private boolean enabledAdaptiveTaskRequestSize;

Member

JunhyungSong Apr 13, 2022

Why isn't it 'true'?

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

                       sources = sources.stream()
                               .map(source -> new TaskSource(
                                       source.getPlanNodeId(),
                                       source.getSplits().stream()
-                                              .filter(scheduledSplit -> scheduledSplit.getSequenceId() > currentMaxAcknowledgedSplit)
+                                              .filter(scheduledSplit -> scheduledSplit.getSequenceId() > maxAcknowledgedSplitByPlanNode.getOrDefault(source.getPlanNodeId(), Long.MIN_VALUE))

Member

JunhyungSong Apr 13, 2022

There could be lots of splits. In this case, there will be lots of map accesses. It would be better to do like the below,

            .map(source -> {
                long currentMaxAcknowledgedSplit = maxAcknowledgedSplitByPlanNode.getOrDefault(source.getPlanNodeId(), Long.MIN_VALUE);
                return new TaskSource(
                        source.getPlanNodeId(),
                        source.getSplits().stream()
                                .filter(scheduledSplit -> scheduledSplit.getSequenceId() > currentMaxAcknowledgedSplit)
                                .collect(Collectors.toSet()),
                        // Like splits, noMoreSplitsForLifespan could be pruned so that only new items will be processed.
                        // This is not happening here because correctness won't be compromised due to duplicate events for noMoreSplitsForLifespan.
                        source.getNoMoreSplitsForLifespan(),
                        source.isNoMoreSplits());
            })

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

-                              .max()
-                              .orElse(maxAcknowledgedSplit);
+                      for (TaskSource taskSource : sources) {
+                          long maxAcknowledgedSplit = taskSource.getSplits().stream()

Member

JunhyungSong Apr 13, 2022

When taskSource.getSplits().size() == 0, it can 'continue;'.

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

+                          long maxAcknowledgedSplit = taskSource.getSplits().stream()
+                                  .mapToLong(ScheduledSplit::getSequenceId)
+                                  .max()
+                                  .orElse(Long.MIN_VALUE);

Member

JunhyungSong Apr 13, 2022

If taskSource.getSplits().size() == 0 is already checked, .orElse(Long.MIN_VALUE) is not needed. Instead of that, just '.getAsLong()'.

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

+                          if (!maxAcknowledgedSplitByPlanNode.containsKey(planNodeId)) {
+                              maxAcknowledgedSplitByPlanNode.put(planNodeId, Long.MIN_VALUE);
+                          }
+                          maxAcknowledgedSplitByPlanNode.computeIfPresent(planNodeId, (key, val) -> maxAcknowledgedSplit > val ? maxAcknowledgedSplit : val);

Member

JunhyungSong Apr 13, 2022

If taskSource.getSplits().size() == 0 is already checked, it should throw when val >= maxAcknowledgedSplit.

e.g.

            maxAcknowledgedSplitByPlanNode.compute(planNodeId, (key, val) -> {
                if (val == null) {
                    return maxAcknowledgedSplit;
                }
                if (val >= maxAcknowledgedSplit) {
                    throw new IllegalStateException(format("%s - splits are out of order? planNodeId=%s, newMax=%d, currentMax=%d", taskId, planNodeId, maxAcknowledgedSplit, val));
                }
                return maxAcknowledgedSplit;
            });

core/trino-main/src/main/java/io/trino/execution/SqlTaskExecution.java

Comment on lines +363 to +365

+                          if (!maxAcknowledgedSplitByPlanNode.containsKey(planNodeId)) {
+                              maxAcknowledgedSplitByPlanNode.put(planNodeId, Long.MIN_VALUE);
+                          }

Member

JunhyungSong Apr 13, 2022

If taskSource.getSplits().size() == 0 is already checked, this is not needed.

core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java

+                          // abandon current request and reschedule update if size of request body exceeds requestSizeLimit
+                          // and splitBatchSize is updated
+                          if (taskUpdateRequestJson.length > maxRequestSize && splitBatchSize < oldSplitBatchSize) {
+                              log.debug("%s - current taskUpdateRequestJson exceeded limit: %d, abandon.", taskId, taskUpdateRequestJson.length);

Member

JunhyungSong Apr 13, 2022

log.info?

JunhyungSong reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java

+                  {
+                      int numSplits = 0;
+                      for (TaskSource taskSource : sources) {
+                          numSplits = Math.max(numSplits, taskSource.getSplits().size());

Member

JunhyungSong Apr 14, 2022

It should find max only for partitioned sources.

JunhyungSong reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java

+                  {
+                      int numSplits = 0;
+                      for (TaskSource taskSource : sources) {
+                          numSplits = Math.max(numSplits, taskSource.getSplits().size());

Member

JunhyungSong Apr 15, 2022

With colocation joins with bucketed tables, there could be more than one partitioned sources. In that case, this logic will become awkward. Please apply this logic only when there is one partitioned source.

Member

bitsondatadev commented Nov 2, 2022

👋 @yansun7 - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open.

bitsondatadev closed this

JunhyungSong mentioned this pull request

Implement adaptive remote task request size #15721

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels