Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushdown more when visit Project Node. #1114

Closed

Conversation

zijian-qiao
Copy link
Contributor

Due to commit 6dcd67 Inline only simple expressions in predicate pushdown,
the SQL with pattern like

SELECT fact.field_x
FROM hive.fact.table as fact
JOIN hive.dim.table as dimension 
ON fact.field_y = dimension.field_y AND fact.field_z = 12345
WHERE
(partition_key = TIMESTAMP '2019-04-11' OR partition_key = TIMESTAMP '2019-05-11')

will scan full table data rather than the data in such 2 partitions.

One of the reason is the function isInliningCandidate thinks references to complex expressions that appear only once (entry.getValue() == 1) can go through the predicate pushdown on Project node, but partition_key appears 2 times in above case.
I think the above pattern is common, and == 1 is too aggressive, so maybe <= 4(8/16) is more reasonable.

@cla-bot cla-bot bot added the cla-signed label Jul 12, 2019
@sopel39
Copy link
Member

sopel39 commented Jul 12, 2019

Could you post EXPLAIN of such plan (e.g where projection caused filter to be stuck above it).

// which come from the node, as opposed to an enclosing scope.
Set<Symbol> childOutputSet = ImmutableSet.copyOf(node.getOutputSymbols());
Map<Symbol, Long> dependencies = SymbolsExtractor.extractAll(expression).stream()
.filter(childOutputSet::contains)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

return dependencies.entrySet().stream()
.allMatch(entry -> entry.getValue() == 1 || node.getAssignments().get(entry.getKey()) instanceof Literal);
.allMatch(entry -> entry.getValue() <= 4 || node.getAssignments().get(entry.getKey()) instanceof Literal);
Copy link
Member

@sopel39 sopel39 Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing it to 4 might cause planning to explode exponentially (in fact even 2 might cause it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it won't, because 4 here means that the biggest explosion is limited to 4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will cause some explosion in the number of terms. E.g., for a query like:

    WITH
    t1 (v) AS (VALUES 1),
    t2 AS( select if(v = 0, v, v) v from t1 ),
    t3 AS( select if(v = 0, v, v) v from t2 ),
    t4 AS( select if(v = 0, v, v) v from t3 ),
    t5 AS( select if(v = 0, v, v) v from t4 ),
    t6 AS( select if(v = 0, v, v) v from t5 ),
    t7 AS( select if(v = 0, v, v) v from t6 ),
    t8 AS( select if(v = 0, v, v) v from t7 ),
    t9 AS( select if(v = 0, v, v) v from t8 ),
    t10 AS( select if(v = 0, v, v) v from t9 ),
    t11 AS( select if(v = 0, v, v) v from t10 ),
    t12 AS( select if(v = 0, v, v) v from t11 ),
    t13 AS( select if(v = 0, v, v) v from t12 ),
    t14 AS( select if(v = 0, v, v) v from t13 ),
    t15 AS( select if(v = 0, v, v) v from t14 ),
    t16 AS( select if(v = 0, v, v) v from t15 )
    select *
    from t16
    where v = 0

One of the filters will be:

((CASE 
  WHEN ((CASE WHEN ("expr_62" = 0) THEN "expr_62" ELSE "expr_62" END) = 0) 
  THEN 
    (CASE WHEN ("expr_62" = 0) THEN "expr_62" ELSE "expr_62" END) 
  ELSE 
    (CASE WHEN ("expr_62" = 0) THEN "expr_62" ELSE "expr_62" END) END
) = 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think this explosion is controlled by <= 4. Or we can change it smaller, but in my case, == 1 will make OR's pushdown fail.
It's just a workaround I think, not a final solution of OR's pushdown.

@sopel39
Copy link
Member

sopel39 commented Jul 12, 2019

It seems that such OR expressions should be really rewritten to IN clause. I've created a ticket for that: #1118

@zijian-qiao
Copy link
Contributor Author

Could you post EXPLAIN of such plan (e.g where projection caused filter to be stuck above it).

Sure.

- Output[a_id] => [b_id:bigint]
        a_id := b_id
    - RemoteExchange[GATHER] => b_id:bigint
        - InnerJoin[("c_id" = "c_id_0")][$hashvalue, $hashvalue_10] => [b_id:bigint]
                Distribution: PARTITIONED
            - RemoteExchange[REPARTITION][$hashvalue] => c_id:bigint, b_id:bigint, $hashvalue:bigint
                - FilterProject[filterPredicate = ((CAST("event_date" AS timestamp) = "$literal$timestamp"(1515628800000)) OR (CAST("event_date" AS timestamp) = "$literal$timestamp"(1515801600000)))] => [c_id:bigint, b_id:bigint, $hashvalue_9:bigint]
                    - ScanFilterProject[table = db:aggregate:cpx, filterPredicate = ("n_id" = BIGINT '169843')] => [event_date:date, c_id:bigint, b_id:bigint, $hashvalue_9:bigint]
                            $hashvalue_9 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("c_id"), 0))
                            LAYOUT: aggregate.cpx
                            n_id := n_id:bigint:1:REGULAR
                            event_date := event_date:date:-1:PARTITION_KEY
                                :: [[2016-06-01, 2018-03-04]]       ### here, it will scan the whole table
                            c_id := c_id:bigint:17:REGULAR
                            b_id := b_id:bigint:14:REGULAR
            - LocalExchange[HASH][$hashvalue_10] ("c_id_0") => c_id_0:bigint, $hashvalue_10:bigint
                - RemoteExchange[REPARTITION][$hashvalue_11] => c_id_0:bigint, $hashvalue_11:bigint
                    - ScanProject[table = db:default:d_id] => [c_id_0:bigint, $hashvalue_12:bigint]
                            $hashvalue_12 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("c_id_0"), 0))
                            LAYOUT: default.d_id
                            c_id_0 := c_id:bigint:1:REGULAR

And when it set to 4, the plan will be:

- Output[a_id] => [b_id:bigint]
        a_id := b_id
    - RemoteExchange[GATHER] => b_id:bigint
        - InnerJoin[("c_id" = "c_id_0")][$hashvalue, $hashvalue_10] => [b_id:bigint]
                Distribution: PARTITIONED
            - RemoteExchange[REPARTITION][$hashvalue] => c_id:bigint, b_id:bigint, $hashvalue:bigint
                - ScanFilterProject[table = db:aggregate:cpx, filterPredicate = (("n_id" = BIGINT '169843') AND ((CAST("event_date" AS timestamp) = "$literal$timestamp"(1515628800000)) OR (CAST("event_date" AS timestamp) = "$literal$timestamp"(1515801600000))))] => [c_id:bigint, b_id:bigint, $hashvalue_9:bigint]
                        $hashvalue_9 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("c_id"), 0))
                        LAYOUT: aggregate.cpx
                        n_id := n_id:bigint:1:REGULAR
                        event_date := event_date:date:-1:PARTITION_KEY
                            :: [[2018-01-11], [2018-01-13]]     ### It will scan only 2 partitions
                        c_id := c_id:bigint:17:REGULAR
                        b_id := b_id:bigint:14:REGULAR
            - LocalExchange[HASH][$hashvalue_10] ("c_id_0") => c_id_0:bigint, $hashvalue_10:bigint
                - RemoteExchange[REPARTITION][$hashvalue_11] => c_id_0:bigint, $hashvalue_11:bigint
                    - ScanProject[table = db:default:d_id] => [c_id_0:bigint, $hashvalue_12:bigint]
                            $hashvalue_12 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("c_id_0"), 0))
                            LAYOUT: default.d_id
                            c_id_0 := c_id:bigint:1:REGULAR

@zijian-qiao
Copy link
Contributor Author

It seems that such OR expressions should be really rewritten to IN clause. I've created a ticket for that: #1118

OK. Thanks!

@sopel39
Copy link
Member

sopel39 commented Sep 25, 2019

Closed via #1515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants