Increase node scheduler config parameters #15579

Dith3r · 2023-01-03T12:35:52Z

Description

Change defaults of max-unacknowledged-splits-per-task and max-adjusted-pending-splits-per-task to 2000 for better overall performance.

SELECT count(*) from hive.test.lineitem_parquet_64_group

old: 6,9304 s
new: 5,8872 s

SELECT count(orderkey) from hive.test.lineitem_parquet_64_group

old: 9,0065 s
new: 8,0764 s

SELECT count(orderkey),count(suppkey) from hive.test.lineitem_parquet_64_group

old: 22,1207 s
new: 21,2324 s

Concurrency benchmark (queries per hours):
new: 7698
old: 7269

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

sopel39 · 2023-01-04T12:19:11Z

@pettyjamesm this PR increases unacknowledged split size to 2000, I think it's fine, wdyt?

docs/src/main/sphinx/admin/properties-node-scheduler.rst

pettyjamesm · 2023-01-04T16:45:34Z

@pettyjamesm this PR increases unacknowledged split size to 2000, I think it's fine, wdyt?

I'm a little skeptical that raising the value to 2,000 is necessary or actually safe to pick up as the default based on the results of running select count queries over parquet files. It's not especially surprising that this change would generate those improvements in that specific scenario since each split will be exceedingly cheap to process and the scheduler latency will dominate query performance- but the TPCH and TPCDS queries appear to show no improvement beyond what looks to be the margin of run-to-run variance.

These configuration properties are probably rarely configured away from their defaults in most deployments so we want to be a little cautious about picking overly-aggressive defaults. In particular, increasing max-unacknowledged-splits-per-task by 4x could result in (almost) 4x larger task update payloads which might create problems for some users.

I have a couple recommendations about how you might want to proceed:

Consider re-running TPCH and TPCDS benchmarks with "small file" (maybe ~1-10MB per file?) data sets to see whether increasing the value to 1,000 or 2,000 actually shows an improvement there without over-biasing towards very "cheap" workloads like select count(*)
Consider making max-unacknowledged-splits-per-task a session property so that higher values can be provided in specific situations where the improvement is expected to be "safe" and the performance improvement is appreciable (eg: select count(*) over tables with small files and without too many columns).

sopel39 · 2023-01-04T17:39:27Z

@pettyjamesm

These configuration properties are probably rarely configured away from their defaults in most deployments so we want to be a little cautious about picking overly-aggressive defaults. In particular, increasing max-unacknowledged-splits-per-task by 4x could result in (almost) 4x larger task update payloads which might create problems for some users.

What was the error that you've been actually observing? I'm particularly interested why 500 was chosen by default for max-unacknowledged-splits-per-task and not some other value.

Consider re-running TPCH and TPCDS benchmarks with "small file" (maybe ~1-10MB per file?) data sets to see whether increasing the value to 1,000 or 2,000 actually shows an improvement there without over-biasing towards very "cheap" workloads like select count(*)

It's not really about TPCH/TPCDS but rather about workloads with:

selective queries (data skipping)
empty splits (big row groups)
small splits
cache

pettyjamesm · 2023-01-04T18:34:30Z

What was the error that you've been actually observing?

Depends on how large the payloads get. Potentially you could trigger request timeouts because of the amount of time the worker spends parsing the task update payload JSON, but before that point you'll see high allocation and GC activity on the coordinator. In presto they added a hard limit of 16MB on the task update body and fail the task immediately to avoid some of the issues they saw at the time.

I'm particularly interested why 500 was chosen by default for max-unacknowledged-splits-per-task and not some other value.

The value was chosen before the existence of the QueueSizeAdjuster in UniformNodeSelector. At the time it was clear that there needed to be some limit on the maximum number of splits sent in a single request but not clear what that limit should be, so I experimented with the small file datasets and chose a value large enough that the performance of queries like select sum(column) from ... no longer improved without seeming "unreasonably high" to me personally. I chose sum instead of count to avoid over-biasing towards super cheap queries and to ensure that the data would actually be read from the input files instead of just getting picked out of the parquet footer.

Dith3r · 2023-01-05T09:18:17Z

Quick local test:

select sum(orderkey) from hive.sf100.lineitem_parquet_256_group

old: 13.8509 ±0.7932
new: 9.2262 ±1.5454

Dith3r · 2023-01-05T10:23:30Z

Test:

SELECT sum(orderkey / 10000) from hive.test.lineitem_parquet_64_group

old: 10.0148 ±0.6281
new: 9.6028 ±0.5703

raunaqmorarka · 2023-01-06T12:53:32Z

docs/src/main/sphinx/admin/properties-node-scheduler.rst

+* **Type:** :ref:`prop-type-integer`
+* **Default value:** ``2000``
+
+Maximum number of splits that are either queued on the coordinator, but not yet sent or confirmed to have been received by


Does increasing this value have any significant implications for memory usage on the coordinator ?
I assume this change implies that every table scan can now queue up 4X more splits on the coordinator.

This is limited by max-splits-per-node which is still set as default to 100. A higher number is currently used when we adjust queue size if splits are processed faster than assigned. Adjustment is triggered only if the node processed its queue between call to computeAssignment otherwise it does not impact query scheduling.

JunhyungSong · 2023-01-14T09:07:10Z

The concerns that @pettyjamesm mentioned can be mitigated by #15721.

Dith3r · 2023-01-31T08:11:52Z

@pettyjamesm Adaptive request size was merged, do you think that increasing configuration values are OK now?

pettyjamesm · 2023-01-31T15:31:50Z

Adaptive request size was merged, do you think that increasing configuration values are OK now?

I'm ok with increasing the value after the adaptive task update request PR, so now I suppose the question is whether 2,000 is appropriate compared to say 1,000. Is there still a noticeable difference between 1,000 and 2,000 for the test queries in your environment? How about 1,500? If so, then sure- 2,000 works for me. If not, then I might suggest erring on the side of a more conservative increase since there will still be some GC overhead on the coordinator for those unacknowledged splits, and there's some risk that this could increase processing time skew between workers when some splits are significantly more expensive than others.

Dith3r · 2023-01-31T17:16:16Z

There are a few configuration options tested like max-splits-per-node, max-adjusted-pending-splits-per-task, max-unacknowledged-splits-per-task tested with range from 1000 up to 4000 with step of 500 (and others like min-pending-splits-per-task with different range) with set of simple queries. Configurations which presented best outcome were tested with tpch, tpds and concurrency test suites. Values presented in this PR presented as better overall setup.

there's some risk that this could increase processing time skew between workers when some splits are significantly more expensive than others.

Still there is min-pending-splits-per-task which is increased only if worker process splits faster than receive form coordinator. Coordinator needs to have more splits to assign, and node was marked as full. Otherwise, it works as before.

cla-bot bot added the cla-signed label Jan 3, 2023

Dith3r requested review from sopel39 and lukasz-stec January 3, 2023 12:35

github-actions bot added the docs label Jan 3, 2023

sopel39 approved these changes Jan 4, 2023

View reviewed changes

sopel39 requested a review from pettyjamesm January 4, 2023 12:18

sopel39 reviewed Jan 4, 2023

View reviewed changes

docs/src/main/sphinx/admin/properties-node-scheduler.rst Outdated Show resolved Hide resolved

Dith3r force-pushed the master branch from f0fe651 to c7d321a Compare January 4, 2023 13:36

sopel39 reviewed Jan 4, 2023

View reviewed changes

docs/src/main/sphinx/admin/properties-node-scheduler.rst Outdated Show resolved Hide resolved

Dith3r force-pushed the master branch from c7d321a to ff8b783 Compare January 4, 2023 14:31

raunaqmorarka reviewed Jan 6, 2023

View reviewed changes

Dith3r force-pushed the master branch 2 times, most recently from f472f97 to 8b0d84e Compare January 10, 2023 08:15

JunhyungSong mentioned this pull request Jan 14, 2023

Implement adaptive remote task request size #15721

Merged

Dith3r force-pushed the master branch from 8b0d84e to afcf56b Compare January 31, 2023 08:07

Increase node scheduler config parameters

3808459

Dith3r force-pushed the master branch from afcf56b to 3808459 Compare January 31, 2023 15:27

sopel39 merged commit 5033623 into trinodb:master Feb 1, 2023

sopel39 mentioned this pull request Feb 1, 2023

Release notes for 407 #15854

Closed

github-actions bot added this to the 407 milestone Feb 1, 2023

colebow mentioned this pull request Feb 10, 2023

Add Trino 407 release notes #15919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase node scheduler config parameters #15579

Increase node scheduler config parameters #15579

Dith3r commented Jan 3, 2023

sopel39 commented Jan 4, 2023

pettyjamesm commented Jan 4, 2023

sopel39 commented Jan 4, 2023

pettyjamesm commented Jan 4, 2023

Dith3r commented Jan 5, 2023

Dith3r commented Jan 5, 2023

raunaqmorarka Jan 6, 2023

Dith3r Jan 9, 2023 •

edited

Loading

JunhyungSong commented Jan 14, 2023

Dith3r commented Jan 31, 2023

pettyjamesm commented Jan 31, 2023

Dith3r commented Jan 31, 2023

Increase node scheduler config parameters #15579

Increase node scheduler config parameters #15579

Conversation

Dith3r commented Jan 3, 2023

Description

Additional context and related issues

Release notes

sopel39 commented Jan 4, 2023

pettyjamesm commented Jan 4, 2023

sopel39 commented Jan 4, 2023

pettyjamesm commented Jan 4, 2023

Dith3r commented Jan 5, 2023

Dith3r commented Jan 5, 2023

raunaqmorarka Jan 6, 2023

Choose a reason for hiding this comment

Dith3r Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

JunhyungSong commented Jan 14, 2023

Dith3r commented Jan 31, 2023

pettyjamesm commented Jan 31, 2023

Dith3r commented Jan 31, 2023

Dith3r Jan 9, 2023 •

edited

Loading