Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase node scheduler config parameters #15579

Merged
merged 1 commit into from
Feb 1, 2023
Merged

Conversation

Dith3r
Copy link
Member

@Dith3r Dith3r commented Jan 3, 2023

Description

Change defaults of max-unacknowledged-splits-per-task and max-adjusted-pending-splits-per-task to 2000 for better overall performance.

SELECT count(*) from hive.test.lineitem_parquet_64_group

old: 6,9304 s
new: 5,8872 s

SELECT count(orderkey) from hive.test.lineitem_parquet_64_group

old: 9,0065 s
new: 8,0764 s

SELECT count(orderkey),count(suppkey) from hive.test.lineitem_parquet_64_group

old: 22,1207 s
new: 21,2324 s

image

Concurrency benchmark (queries per hours):
new: 7698
old: 7269

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Jan 3, 2023
@Dith3r Dith3r requested review from sopel39 and lukasz-stec January 3, 2023 12:35
@github-actions github-actions bot added the docs label Jan 3, 2023
@sopel39 sopel39 requested a review from pettyjamesm January 4, 2023 12:18
@sopel39
Copy link
Member

sopel39 commented Jan 4, 2023

@pettyjamesm this PR increases unacknowledged split size to 2000, I think it's fine, wdyt?

@pettyjamesm
Copy link
Member

@pettyjamesm this PR increases unacknowledged split size to 2000, I think it's fine, wdyt?

I'm a little skeptical that raising the value to 2,000 is necessary or actually safe to pick up as the default based on the results of running select count queries over parquet files. It's not especially surprising that this change would generate those improvements in that specific scenario since each split will be exceedingly cheap to process and the scheduler latency will dominate query performance- but the TPCH and TPCDS queries appear to show no improvement beyond what looks to be the margin of run-to-run variance.

These configuration properties are probably rarely configured away from their defaults in most deployments so we want to be a little cautious about picking overly-aggressive defaults. In particular, increasing max-unacknowledged-splits-per-task by 4x could result in (almost) 4x larger task update payloads which might create problems for some users.

I have a couple recommendations about how you might want to proceed:

  1. Consider re-running TPCH and TPCDS benchmarks with "small file" (maybe ~1-10MB per file?) data sets to see whether increasing the value to 1,000 or 2,000 actually shows an improvement there without over-biasing towards very "cheap" workloads like select count(*)

  2. Consider making max-unacknowledged-splits-per-task a session property so that higher values can be provided in specific situations where the improvement is expected to be "safe" and the performance improvement is appreciable (eg: select count(*) over tables with small files and without too many columns).

@sopel39
Copy link
Member

sopel39 commented Jan 4, 2023

@pettyjamesm

These configuration properties are probably rarely configured away from their defaults in most deployments so we want to be a little cautious about picking overly-aggressive defaults. In particular, increasing max-unacknowledged-splits-per-task by 4x could result in (almost) 4x larger task update payloads which might create problems for some users.

What was the error that you've been actually observing? I'm particularly interested why 500 was chosen by default for max-unacknowledged-splits-per-task and not some other value.

Consider re-running TPCH and TPCDS benchmarks with "small file" (maybe ~1-10MB per file?) data sets to see whether increasing the value to 1,000 or 2,000 actually shows an improvement there without over-biasing towards very "cheap" workloads like select count(*)

It's not really about TPCH/TPCDS but rather about workloads with:

  • selective queries (data skipping)
  • empty splits (big row groups)
  • small splits
  • cache

@pettyjamesm
Copy link
Member

What was the error that you've been actually observing?

Depends on how large the payloads get. Potentially you could trigger request timeouts because of the amount of time the worker spends parsing the task update payload JSON, but before that point you'll see high allocation and GC activity on the coordinator. In presto they added a hard limit of 16MB on the task update body and fail the task immediately to avoid some of the issues they saw at the time.

I'm particularly interested why 500 was chosen by default for max-unacknowledged-splits-per-task and not some other value.

The value was chosen before the existence of the QueueSizeAdjuster in UniformNodeSelector. At the time it was clear that there needed to be some limit on the maximum number of splits sent in a single request but not clear what that limit should be, so I experimented with the small file datasets and chose a value large enough that the performance of queries like select sum(column) from ... no longer improved without seeming "unreasonably high" to me personally. I chose sum instead of count to avoid over-biasing towards super cheap queries and to ensure that the data would actually be read from the input files instead of just getting picked out of the parquet footer.

@Dith3r
Copy link
Member Author

Dith3r commented Jan 5, 2023

Quick local test:

select sum(orderkey) from hive.sf100.lineitem_parquet_256_group

old: 13.8509 ±0.7932
new: 9.2262 ±1.5454

@Dith3r
Copy link
Member Author

Dith3r commented Jan 5, 2023

Test:

SELECT sum(orderkey / 10000) from hive.test.lineitem_parquet_64_group

old: 10.0148 ±0.6281
new: 9.6028 ±0.5703

* **Type:** :ref:`prop-type-integer`
* **Default value:** ``2000``

Maximum number of splits that are either queued on the coordinator, but not yet sent or confirmed to have been received by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does increasing this value have any significant implications for memory usage on the coordinator ?
I assume this change implies that every table scan can now queue up 4X more splits on the coordinator.

Copy link
Member Author

@Dith3r Dith3r Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is limited by max-splits-per-node which is still set as default to 100. A higher number is currently used when we adjust queue size if splits are processed faster than assigned. Adjustment is triggered only if the node processed its queue between call to computeAssignment otherwise it does not impact query scheduling.

@JunhyungSong
Copy link
Member

The concerns that @pettyjamesm mentioned can be mitigated by #15721.

@Dith3r
Copy link
Member Author

Dith3r commented Jan 31, 2023

@pettyjamesm Adaptive request size was merged, do you think that increasing configuration values are OK now?

@pettyjamesm
Copy link
Member

Adaptive request size was merged, do you think that increasing configuration values are OK now?

I'm ok with increasing the value after the adaptive task update request PR, so now I suppose the question is whether 2,000 is appropriate compared to say 1,000. Is there still a noticeable difference between 1,000 and 2,000 for the test queries in your environment? How about 1,500? If so, then sure- 2,000 works for me. If not, then I might suggest erring on the side of a more conservative increase since there will still be some GC overhead on the coordinator for those unacknowledged splits, and there's some risk that this could increase processing time skew between workers when some splits are significantly more expensive than others.

@Dith3r
Copy link
Member Author

Dith3r commented Jan 31, 2023

There are a few configuration options tested like max-splits-per-node, max-adjusted-pending-splits-per-task, max-unacknowledged-splits-per-task tested with range from 1000 up to 4000 with step of 500 (and others like min-pending-splits-per-task with different range) with set of simple queries. Configurations which presented best outcome were tested with tpch, tpds and concurrency test suites. Values presented in this PR presented as better overall setup.

there's some risk that this could increase processing time skew between workers when some splits are significantly more expensive than others.

Still there is min-pending-splits-per-task which is increased only if worker process splits faster than receive form coordinator. Coordinator needs to have more splits to assign, and node was marked as full. Otherwise, it works as before.

@sopel39 sopel39 merged commit 5033623 into trinodb:master Feb 1, 2023
@sopel39 sopel39 mentioned this pull request Feb 1, 2023
@github-actions github-actions bot added this to the 407 milestone Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants