feat(planner): Implement hash inner join #5175

leiysky · 2022-05-05T13:07:59Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Initial version of hash join.

Only support inner join with ON clause, for example:

select * from t inner join t1 on t.a = t1.a

Changelog

New Feature

Related Issues

Fixes #issue

vercel · 2022-05-05T13:08:04Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Updated
databend	⬜️ Ignored (Inspect)		May 7, 2022 at 0:27AM (UTC)

mergify · 2022-05-05T13:08:18Z

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

Xuanwo · 2022-05-05T13:15:04Z

Join the world!

leiysky · 2022-05-05T13:19:31Z

query/src/interpreters/interpreter_select_v2.rs

+        let (root_pipeline, pipelines) = planner.plan_sql(self.query.as_str()).await?;
        let async_runtime = self.ctx.get_storage_runtime();
-        let executor = PipelinePullingExecutor::try_create(async_runtime, pipeline)?;
+
+        // Spawn sub-pipelines
+        for pipeline in pipelines {
+            let executor = PipelineExecutor::create(async_runtime.clone(), pipeline)?;
+            executor.execute()?;
+        }
+
+        // Spawn root pipeline
+        let executor = PipelinePullingExecutor::try_create(async_runtime, root_pipeline)?;


@zhang2014

I found it's tricky to put the parallel pipelines into a same NewPipeline.

In current implementation, it would spawn sub-tasks(i.e. building hash table) with PipelineExecutor, and spawn root task(i.e. the pipeline which probes hash table) with PipelinePullingExecutor.

The root task will be blocked until the hash table building is finished, which is achieved by polling the state of HashJoinState.

spawn sub-tasks is ok. I'll work on this task.

We need add NewPipe type(zip nested pipeline last pipe with outer pipeline previous pipe):

NewPipe::NestedPipeline { nested_pipeline: NewPipeline, processors: Vec<ProcessorPtr>, outputs_port: Vec<Arc<OutputPort>>, inputs_port: Vec<(Arc<InputPort>, Arc<InputPort>)>, }

@zhang2014 That's amazing.

I suggest to make nested_pipeline a Vec<NewPipeline>, so we can support multi-way UNION and multi-way join in the future with it.

query/src/sql/exec/mod.rs

query/src/pipelines/new/processors/transforms/hash_join/hash_table.rs

sundy-li · 2022-05-07T09:47:49Z

query/src/pipelines/new/processors/transforms/hash_join/hash_table.rs

+        let columns: Vec<&ColumnRef> = columns_vec.iter().collect();
+        Ok(match &self.hash_method_kind {
+            HashMethodKind::Serializer(method) => method
+                .build_keys(&columns, row_count)?


I did not find any reason to use hash_method_kind.
Evaluate xxhash64 may be more simple way to do it.

I was trying to reuse the HashMethod, but it seems the need of hashing is different from Aggregate.

It only needs a hashed u64 value here. Maybe I can just compute hash for each column and combine them together?

HashMethod is optimized in the case of group aggregation, it produces unique keys(no collision) and leads to more memory.

Now you only need a hashed u64 value, so HashMethod may be useless unless you store your state using HashTable.

I got your point, now hash function may not work for multi-columns. Let's keep it as now and refactor it in the future.

@sundy-li I have remove the HashMethod and use hash Function instead. PTAL

Maybe we can make HashMethod a more common component later.

xudong963 · 2022-05-07T09:45:09Z

query/src/sql/optimizer/rule/rule_implement_hash_join.rs

+        let logical_inner_join: LogicalInnerJoin = plan.try_into()?;
+
+        let result = SExpr::create(
+            PhysicalHashJoin {


In the future, when our statistics module matures, it seems that the build side can be determined based on statistical information

Yes. We will always choose right side as build side. The join commutativity rule can help us enumurate the candidates.

xudong963 · 2022-05-07T09:49:21Z

query/src/sql/exec/data_schema_builder.rs

+            fields.push(field.clone());
+        }
+
+        DataSchemaRefExt::create(fields)


👍, other build_xxx methods can also use the API.

BohuTANG · 2022-05-07T11:24:08Z

Oops, Conflicting files :\

leiysky · 2022-05-07T11:25:38Z

Oops, Conflicting files :\

I'm fixing that.

sundy-li · 2022-05-07T13:33:36Z

/LGTM

sundy-li · 2022-05-07T13:54:35Z

tests/suites/0_stateless/20+_others/20_0001_planner_v2.sql

+select * from t inner join t1 on cast(t.a as float) = t1.b;
+select * from t inner join t2 on t.a = t2.c;
+select * from t inner join t2 on t.a = t2.c + 1;


I did not get it, why does this query have empty results?

It looks like a bug, this query should output:

a c ---- 2 1 3 2

Just as the following query.

I'll fix it.

select null+1; will panic in v2:

022-05-07T14:07:37.182626Z ERROR common_tracing::panic_hook: panicked at 'called `Option::unwrap()` on a `None` value', query/src/sql/planner/binder/project.rs:44:59 backtrace=Backtrace [{ fn: "common_tracing::panic_hook::set_panic_hook::{{closure}}", file: "./common/tracing/src/panic_hook.rs", line: 25 }, { fn: "std::panicking::rust_panic_with_hook",

leiysky added the C-feature Category: feature label May 5, 2022

leiysky requested a review from zhang2014 May 5, 2022 13:07

databend-bot added the need-review label May 5, 2022

mergify bot added the pr-feature this PR introduces a new feature to the codebase label May 5, 2022

leiysky commented May 5, 2022

View reviewed changes

leiysky force-pushed the hash-join branch from cbf5ddd to 8eafa22 Compare May 7, 2022 06:53

leiysky marked this pull request as ready for review May 7, 2022 07:00

leiysky requested a review from BohuTANG as a code owner May 7, 2022 07:00

leiysky requested review from sundy-li and xudong963 May 7, 2022 07:00

leiysky force-pushed the hash-join branch from cd03b72 to e427df8 Compare May 7, 2022 07:46

xudong963 reviewed May 7, 2022

View reviewed changes

query/src/sql/exec/mod.rs Outdated Show resolved Hide resolved

sundy-li reviewed May 7, 2022

View reviewed changes

query/src/pipelines/new/processors/transforms/hash_join/hash_table.rs Outdated Show resolved Hide resolved

sundy-li reviewed May 7, 2022

View reviewed changes

xudong963 reviewed May 7, 2022

View reviewed changes

implement hash join

b9ee289

leiysky force-pushed the hash-join branch from 26c452b to b9ee289 Compare May 7, 2022 12:11

fix clippy

6efe9c5

BohuTANG requested a review from sundy-li May 7, 2022 12:28

databend-bot approved these changes May 7, 2022

View reviewed changes

BohuTANG merged commit d824fd9 into databendlabs:main May 7, 2022

sundy-li reviewed May 7, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): Implement hash inner join #5175

feat(planner): Implement hash inner join #5175

leiysky commented May 5, 2022

vercel bot commented May 5, 2022 •

edited

Loading

mergify bot commented May 5, 2022

Xuanwo commented May 5, 2022

leiysky May 5, 2022

zhang2014 May 5, 2022

leiysky May 5, 2022

sundy-li May 7, 2022

leiysky May 7, 2022

sundy-li May 7, 2022 •

edited

Loading

sundy-li May 7, 2022

leiysky May 7, 2022 •

edited

Loading

xudong963 May 7, 2022

leiysky May 7, 2022

xudong963 May 7, 2022

BohuTANG commented May 7, 2022

leiysky commented May 7, 2022

sundy-li commented May 7, 2022

sundy-li May 7, 2022

leiysky May 7, 2022

BohuTANG May 7, 2022

feat(planner): Implement hash inner join #5175

feat(planner): Implement hash inner join #5175

Conversation

leiysky commented May 5, 2022

Summary

Changelog

Related Issues

vercel bot commented May 5, 2022 • edited Loading

mergify bot commented May 5, 2022

Xuanwo commented May 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sundy-li May 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leiysky May 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BohuTANG commented May 7, 2022

leiysky commented May 7, 2022

sundy-li commented May 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented May 5, 2022 •

edited

Loading

sundy-li May 7, 2022 •

edited

Loading

leiysky May 7, 2022 •

edited

Loading