Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy build join output to improve performance of ALL join #58278

Merged
merged 14 commits into from
Mar 5, 2024

Conversation

liuneng1994
Copy link
Contributor

@liuneng1994 liuneng1994 commented Dec 28, 2023

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Lazy build join output to improve performance of ALL join

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@liuneng1994
Copy link
Contributor Author

image
I compare #54662 and #56996 using intel vtune. The main performance difference is front-end bound. From the code point of view, delaying output generation can improve the cache hit rate of L1.
@vdimir

@liuneng1994 liuneng1994 changed the title Lazy build join output to improve performance of ALL join Lazy build join output to improve performance of join Dec 28, 2023
@vdimir vdimir self-assigned this Dec 28, 2023
@vdimir vdimir added the can be tested Allows running workflows for external contributors label Dec 28, 2023
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-performance Pull request with some performance improvements label Dec 28, 2023
@robot-ch-test-poll4
Copy link
Contributor

robot-ch-test-poll4 commented Dec 28, 2023

This is an automated comment for commit 4af3395 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help❌ failure
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure

@liuneng1994 liuneng1994 force-pushed the optimize-all-join branch 3 times, most recently from be74588 to ad6ce16 Compare January 4, 2024 09:00
@liuneng1994 liuneng1994 changed the title Lazy build join output to improve performance of join Lazy build join output to improve performance of ALL join Jan 5, 2024
@liuneng1994
Copy link
Contributor Author

Aarch64
image
image
X86-64
image
image
A very strange phenomenon. In the previous version, the performance of ANY join on Aarch64 was reduced by 30%. In the current version of Aarch64, there is no regression, but the performance is reduced by 10% on x86.
I can't reproduce this regression on my machine (12900k), tried many ways. I added a new any join test case, and there were no performance problems after testing. So I suspect it's due to the specificity of the use case.
From an overall point of view, this optimization should have more advantages than disadvantages.
@vdimir

@baibaichen
Copy link
Contributor

@alexey-milovidov @vdimir any comments?

Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it slipped my mind.
In general looks reasonable. Maybe we can also introduce a way to support old behaviour, so if we found any cases with degradation we could fall back? But I'm not sure how to fall back, because introducing a setting also doesn't look practical as well.

continue;
}
}
col->insertFrom(*column_from_block.column, lazy_output.row_nums[j]);
Copy link
Contributor

@binmahone binmahone Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert is a virtual function , it maybe heavy to call it repeatly in a loop. Will sth like insertSelective() helps perf here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's bottleneck or not, but it's worth trying, perhaps in another PR

@liuneng1994
Copy link
Contributor Author

Sorry it slipped my mind. In general looks reasonable. Maybe we can also introduce a way to support old behaviour, so if we found any cases with degradation we could fall back? But I'm not sure how to fall back, because introducing a setting also doesn't look practical as well.

I have previously verified that as long as any join uses the original algorithm, the performance regression problem can be solved. I can add a subclass LazyAddedColumns to AddedColumns, and select which one to use in HashJoin::joinBlockImpl based on whether the join type is any join. Is this plan acceptable?

@vdimir
Copy link
Member

vdimir commented Feb 21, 2024

I can add a subclass LazyAddedColumns to AddedColumns, and select which one to use in HashJoin::joinBlockImpl based on whether the join type is any join. Is this plan acceptable?

Yes, sounds good 👍

@liuneng1994
Copy link
Contributor Author

image
Virtual function calls on AddedColumns can severely impact performance.
Using an insertSelective call in build_output instead of calling insert in a for loop can reduce a large number of virtual function calls, but the premise of doing so is to merge the original data in the hashmap into a large block. RowRef only saves the row number, not the block pointer

@liuneng1994
Copy link
Contributor Author

image

@liuneng1994
Copy link
Contributor Author

liuneng1994 commented Feb 27, 2024

@vdimir Ready for review. The performance regression problem of any join has been solved.
Fortunately, I found that this version of the code also greatly improved some use cases in joins_in_memory.
image
image
Multiple CI results have shown performance improvement. The specific reasons for the improvement need to be analyzed. The core code hasn't changed much.

@liuneng1994
Copy link
Contributor Author

@vdimir any comments?

apply_default();
const auto & column_from_block = reinterpret_cast<const Block *>(lazy_output.blocks[j])->getByPosition(right_indexes[i]);
/// If it's joinGetOrNull, we need to wrap not-nullable columns in StorageJoin.
if (is_join_get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume is_join_get = false in all cases, because we call joinBlockImpl with Any for dictGet. But I'm not 100% sure, also if something changed we will get a wrong behavior, so we can keep this if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t particularly understand the logic here. In order not to cause unnecessary bugs, I have retained this logic.

@vdimir
Copy link
Member

vdimir commented Mar 5, 2024

Thank you for your effort!
I'll merge a PR once A Sync Pending is finished

@vdimir vdimir merged commit a8eeb89 into ClickHouse:master Mar 5, 2024
260 of 268 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants