-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy build join output to improve performance of ALL join #58278
Conversation
This is an automated comment for commit 4af3395 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
be74588
to
ad6ce16
Compare
Aarch64 |
@alexey-milovidov @vdimir any comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it slipped my mind.
In general looks reasonable. Maybe we can also introduce a way to support old behaviour, so if we found any cases with degradation we could fall back? But I'm not sure how to fall back, because introducing a setting also doesn't look practical as well.
src/Interpreters/HashJoin.cpp
Outdated
continue; | ||
} | ||
} | ||
col->insertFrom(*column_from_block.column, lazy_output.row_nums[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert is a virtual function , it maybe heavy to call it repeatly in a loop. Will sth like insertSelective() helps perf here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's bottleneck or not, but it's worth trying, perhaps in another PR
I have previously verified that as long as any join uses the original algorithm, the performance regression problem can be solved. I can add a subclass LazyAddedColumns to AddedColumns, and select which one to use in HashJoin::joinBlockImpl based on whether the join type is any join. Is this plan acceptable? |
Yes, sounds good 👍 |
448a664
to
2279885
Compare
7e5f944
to
c1df83a
Compare
@vdimir Ready for review. The performance regression problem of any join has been solved. |
@vdimir any comments? |
apply_default(); | ||
const auto & column_from_block = reinterpret_cast<const Block *>(lazy_output.blocks[j])->getByPosition(right_indexes[i]); | ||
/// If it's joinGetOrNull, we need to wrap not-nullable columns in StorageJoin. | ||
if (is_join_get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume is_join_get = false
in all cases, because we call joinBlockImpl
with Any
for dictGet. But I'm not 100% sure, also if something changed we will get a wrong behavior, so we can keep this if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t particularly understand the logic here. In order not to cause unnecessary bugs, I have retained this logic.
Thank you for your effort! |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Lazy build join output to improve performance of ALL join