Prune Nested Fields for Parquet Columns #5547

zhenxiao · 2016-06-30T00:01:10Z

Read necessary fields only for Parquet nested columns
Currently, Presto will read all the fields in a struct for Parquet columns.
e.g.

select s.a, s.b
from t

if it is a parquet file, with struct column s: {a int, b double, c long, d float}
current Presto will read a, b, c, d from s, and output just a and b

For columnar storage as Parquet or ORC, we could do better, by just reading the necessary fields. In the previous example, just read {a int, b double} from s. Not reading other fields to save IO.

This patch introduces an optional NestedFields in ColumnHandle. When optimizing the plan, PruneNestedColumns optimizer will visit expressions, and put candidate nested fields into ColumnHandle. When scanning parquet files, the record reader could use NestedFields to specify necessary fields only for parquet files.

This has an dependency on @jxiang 's #4714, which gives us the flexibility to specify metastore schemas differently from parquet file schemas.

@dain @martint @electrum @cberner @erichwang any comments are appreciated

martint · 2016-06-30T18:00:25Z

presto-hive/pom.xml

@@ -204,7 +204,7 @@
        <dependency>
            <groupId>com.facebook.presto</groupId>
            <artifactId>presto-main</artifactId>
-            <scope>test</scope>
+            <scope>provided</scope>


The hive connector shouldn't have a dependency on presto-main. In fact, the classes in presto-main are not guaranteed to be available to plugins (due to classloader isolation)

get it. Really need RowType and RowField, which are in presto-main, not in spi. Is there any special reason RowType (MapType, etc.) is not in spi? Could we move RowType in spi? Or, maybe duplicate the code in presto-hive? @martint @cberner @dain

zhenxiao · 2016-07-07T23:43:14Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

+            List<parquet.schema.Type> fieldTypes = entryType.getFields();
+
+            if (useNames) {
+                this.converters = createConvertersByName(columnName, ((RowType) prestoType).getFields(), fieldTypes);


@cberner we need to get RowType's Fields, and use the RowField name to look up the corresponding parquet schema in parquet files

Ya, you should be able to just use the TypeSignature all the names will be in there

zhenxiao · 2016-07-08T22:53:06Z

with @cberner help, get comments addressed, no longer need presto-main dependency any more
@cberner @dain @martint @electrum comments or suggestions are appreciated

cberner · 2016-07-12T23:35:22Z

@nezihyigitbasi do you have time to take a first look at this?

nezihyigitbasi · 2016-07-12T23:36:05Z

@cberner sure, will do.

nezihyigitbasi · 2016-07-13T00:31:24Z

@zhenxiao I gave this patch a try and I guess hit a bug. Once you fix the issue I will continue reviewing.

I first enabled the new optimizer in the config file and then created a test table

presto:nyigitbasi> create table nested_field_test as select CAST(row(1, 'a', 'b', 'c') AS ROW(f1 integer, f2 varchar, f3 varchar, f4 varchar)) as row_field;
CREATE TABLE: 1 row

presto:nyigitbasi> desc nested_field_test;
  Column   |                        Type                         | Comment
-----------+-----------------------------------------------------+---------
 row_field | row(F1 integer, F2 varchar, F3 varchar, F4 varchar) |
(1 row)

then when I query the table it fails:

presto:nyigitbasi> select row_field.F1 from nested_field_test;

Query 20160713_002306_00019_he7h3, FAILED, 1 node
http://localhost:8080/query.html?20160713_002306_00019_he7h3
Splits: 2 total, 0 done (0.00%)
CPU Time: 0.5s total,     0 rows/s,     0B/s, 37% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
1:13 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20160713_002306_00019_he7h3 failed: Error opening Hive split s3n://netflix-dataoven-prod-users/hive/warehouse/nyigitbasi.db/nested_field_test/20160713_002223_00017_he7h3_a34c4336-28ce-4a72-9253-3529044d3710 (offset=0, length=548): Schema mismatch, metastore schema for row column row_field has 4 fields but parquet schema has 1 fields
com.facebook.presto.spi.PrestoException: Error opening Hive split s3n://netflix-dataoven-prod-users/hive/warehouse/nyigitbasi.db/nested_field_test/20160713_002223_00017_he7h3_a34c4336-28ce-4a72-9253-3529044d3710 (offset=0, length=548): Schema mismatch, metastore schema for row column row_field has 4 fields but parquet schema has 1 fields
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:489)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:254)
    at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createHiveRecordCursor(ParquetRecordCursorProvider.java:96)
    at com.facebook.presto.hive.HivePageSourceProvider.getHiveRecordCursor(HivePageSourceProvider.java:129)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:107)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:48)
    at com.facebook.presto.operator.ScanFilterAndProjectOperator.createSourceIfNecessary(ScanFilterAndProjectOperator.java:292)
    at com.facebook.presto.operator.ScanFilterAndProjectOperator.isFinished(ScanFilterAndProjectOperator.java:180)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:375)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:618)
    at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
    at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Schema mismatch, metastore schema for row column row_field has 4 fields but parquet schema has 1 fields
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:145)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$ParquetStructConverter.createConverters(ParquetHiveRecordCursor.java:864)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$ParquetStructConverter.<init>(ParquetHiveRecordCursor.java:946)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createGroupConverter(ParquetHiveRecordCursor.java:821)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.access$300(ParquetHiveRecordCursor.java:125)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor$PrestoReadSupport.<init>(ParquetHiveRecordCursor.java:534)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:428)
    ... 16 more

But if I disable the new optimizer I can query the table fine.

presto:nyigitbasi> set session optimize_nested_columns=false;
SET SESSION
presto:nyigitbasi> select row_field.F1 from nested_field_test;
 f1
----
  1
(1 row)

zhenxiao · 2016-07-13T01:35:15Z

@nezihyigitbasi Thank you, get it fixed. Your comments are appreciated.

nezihyigitbasi · 2016-07-13T15:57:11Z

presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java

@@ -497,17 +505,19 @@ public PrestoParquetRecordReader(PrestoReadSupport readSupport)
        private final boolean useParquetColumnNames;
        private final List<HiveColumnHandle> columns;
        private final List<Converter> converters;
+        private final TypeManager typeManager;


do you need this as a field? This is not used anywhere, the constructor arg typeManager is passed to the createGroupConverter() call.

zhenxiao · 2019-03-19T03:04:35Z

thank you, @mbasmanova
I get comments addressed.

mbasmanova

@zhenxiao Some more comments.

mbasmanova · 2019-03-19T14:28:52Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        return ExpressionTreeRewriter.rewriteWith(new DereferenceReplacer(replacements), expression);
+    }
+
+    protected static class DereferenceReplacer


mbasmanova · 2019-03-19T14:29:34Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        return false;
+    }
+
+    protected static List<DereferenceExpression> extractDereference(Expression expression)


since this method returns a list, extractDereferenceExpressions might be a better name

mbasmanova · 2019-03-19T14:31:22Z

.../src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownFilter.java

+            return filterNode;
+        }
+
+        List<DereferenceExpression> predicates = extractDereference(filterNode.getPredicate());


this variable contains a list of dereference expressions used in the predicate, but the expressions themselves are not predicates, e.g. a.b < 10 is a predicate, but a.b is not. Since this variable is used only once, it can be inlined which removes the need to come up with a better name.

mbasmanova · 2019-03-19T14:39:53Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        return Result.ofPlanNode(target);
+    }
+
+    protected abstract PlanNode dereferencePushDown(Context context, N targetNode, Map<DereferenceExpression, Symbol> expressions, Assignments assignments);


as a general rule, method names start with a verb, e.g. pushDownDereferences

mbasmanova · 2019-03-19T15:11:29Z

...src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownProject.java

+            return projectNode;
+        }
+
+        Map<Symbol, Expression> pushdownDereferences = pushdownExpressions.entrySet().stream().collect(toImmutableMap(Map.Entry::getValue, Map.Entry::getKey));


This variable is not needed. You can swap key and value in pushdownExpressions instead.

Map<Symbol, DereferenceExpression> pushdownExpressions = expressions.entrySet().stream() .filter(entry -> !isCastRowType(entry.getKey())) .filter(entry -> outputSymbols.contains(getOnlyElement(extractAll(entry.getKey())))) .collect(toImmutableMap(Map.Entry::getValue, Map.Entry::getKey)); ... ImmutableMap.Builder<Symbol, Symbol> symbolsBuilder = ImmutableMap.builder(); pushdownExpressions.entrySet().stream() .forEach(entry -> symbolsBuilder.put(getOnlyElement(extractAll(entry.getValue())), entry.getKey())); Map<Symbol, Symbol> symbolsMap = symbolsBuilder.build(); ... DereferenceExpression targetDereference = pushdownExpressions.get(targetSymbol);

mbasmanova · 2019-03-19T15:14:59Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+            return joinNode;
+        }
+
+        Map<Symbol, Expression> pushdownDereferences = pushdownExpressions.entrySet().stream().collect(toImmutableMap(Map.Entry::getValue, Map.Entry::getKey));


similar to other rules, this variable is not needed

mbasmanova · 2019-03-19T15:15:23Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+
+        Map<Symbol, Expression> pushdownDereferences = pushdownExpressions.entrySet().stream().collect(toImmutableMap(Map.Entry::getValue, Map.Entry::getKey));
+        ImmutableMap.Builder<Symbol, Symbol> symbolsBuilder = ImmutableMap.builder();
+        for (Map.Entry<Expression, Symbol> entry : pushdownExpressions.entrySet()) {


consider

pushdownExpressions.entrySet().stream() .forEach(entry -> symbolsBuilder.put(getOnlyElement(extractAll(entry.getValue())), entry.getKey()));

mbasmanova · 2019-03-19T15:17:55Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+                .map(EquiJoinClause::toExpression)
+                .map(expression -> replaceDereferences(expression, expressions))
+                .map(this::getEquiJoinClause)
+                .collect(toImmutableList());


EqualJoinClause can only reference symbols; it cannot contain dereference expressions, can it?

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownSort.java

mbasmanova · 2019-03-19T15:21:50Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        Assignments assignments = assignmentsBuilder.build();
+
+        PlanNode result = dereferencePushDown(context, child, expressions, assignments);
+        if (result.getId().equals(child.getId())) {


I think it would be clearer if dereferencePushDown returned a Result. Then this check would be simplified to if (result.isEmpty())

zhenxiao

thank you, @mbasmanova
I get comments addressed

zhenxiao · 2019-03-19T22:50:19Z

...src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownProject.java

+                Symbol targetSymbol = symbolsMap.get(entry.getKey());
+                DereferenceExpression targetDereference = (DereferenceExpression) pushdownDereferences.get(targetSymbol);
+                if (entry.getValue() instanceof DereferenceExpression) {
+                    sourceBuilder.put(targetSymbol, targetDereference);


you are correct. this never happens. It is dead code. I will fix

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownSort.java

mbasmanova

@zhenxiao A few questions.

mbasmanova · 2019-03-20T14:19:29Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        return context.getSymbolAllocator().newSymbol(expression, type);
+    }
+
+    private static boolean prefixExist(DereferenceExpression expression, final Set<DereferenceExpression> dereferences)


drop final

mbasmanova · 2019-03-20T14:23:34Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        Expression base = expression.getBase();
+        while (base != null) {
+            if (base instanceof SymbolReference) {
+                return dereferences.contains(base);


dereferences is a set of DereferenceExpression and base is an instance of SymbolReference; will this ever return true? I don't think so. Looks like this whole if statement can be deleted and the outer loop simplified to

while (base instanceof DereferenceExpression) { if (dereferences.contains(base)) { return true; } base = ((DereferenceExpression) base).getBase(); }

mbasmanova · 2019-03-20T14:24:01Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        return context.getSymbolAllocator().newSymbol(expression, type);
+    }
+
+    private static boolean prefixExist(DereferenceExpression expression, final Set<DereferenceExpression> dereferences)


baseExists might be a better name

mbasmanova · 2019-03-20T14:29:40Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+
+    private static Map<DereferenceExpression, Symbol> getDereferenceSymbolMap(ProjectNode node, Context context, Metadata metadata, SqlParser sqlParser)
+    {
+        Set<DereferenceExpression> expressions = extractExpressionsNonRecursive(node).stream()


node.getAssignments().getExpressions() is equivalent but easier to understand

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

mbasmanova · 2019-03-20T14:43:26Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        for (Map.Entry<Symbol, Expression> entry : node.getAssignments().entrySet()) {
+            assignmentsBuilder.put(entry.getKey(), ExpressionTreeRewriter.rewriteWith(new DereferenceReplacer(expressions), entry.getValue()));
+        }
+        Assignments assignments = assignmentsBuilder.build();


Simplify using Assignments::rewrite:

Assignments assignments = node.getAssignments().rewrite(new DereferenceReplacer(expressions));

mbasmanova · 2019-03-20T15:02:31Z

presto-main/src/test/java/com/facebook/presto/sql/planner/TestDereferencePushDown.java

+    @Test
+    public void testDereferencePushdownProject()
+    {
+        assertPlan("WITH t1 as ( SELECT * FROM (values ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)))) as t (msg) ) SELECT msg.x FROM t1 WHERE msg.x > 10",


Reformat for readability (capitalize SQL keywords, split into multiple lines, replace t1 with t(msg)):

assertPlan("WITH t(msg) AS (SELECT * FROM (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE))))) " + "SELECT msg.x FROM t WHERE msg.x > 10", output(ImmutableList.of("x"), project(ImmutableMap.of("x", expression("msg.x")), filter("msg.x > BIGINT '10'", values("msg")))));

However, I'd expect msg.x projection to be pushed down, e.g. I'd think the plan would look like

Output(x) Filter(x > 10) Project(x: msg.x) Values

Also, I removed the new rules from the optimizer and re-ran the tests. All tests but testDereferencePushdownJoin succeeded. Could you modify the tests to use queries that are affected by the new rules?

let me rework the testcases
predicatePushDown will push down filter just on top of tablescan or values. It is OK we have dereference just on top of filter, as long as ScanFilterAndProject has dereferences, so that virtual nested columns could be generated by leveraging dereferences in connector. What do you think?

@zhenxiao That's right. This is a subtle point that's not easy to get from reading the rules code. Perhaps, modify the rules to not do anything for project(filter(tablescan)) and project(filter(values)). It would also help to update the PR description to give examples of queries where dereference pushdown will be successful and show queries where new rules don't make a difference. For the individual rules, consider adding documentation. See com.facebook.presto.sql.planner.iterative.rule.PushProjectionThroughExchange for an example.

zhenxiao

thank you, @mbasmanova
Let me rework on the testcases. other comments are addressed.

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

zhenxiao · 2019-03-21T08:44:05Z

presto-main/src/test/java/com/facebook/presto/sql/planner/TestDereferencePushDown.java

+    @Test
+    public void testDereferencePushdownProject()
+    {
+        assertPlan("WITH t1 as ( SELECT * FROM (values ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)))) as t (msg) ) SELECT msg.x FROM t1 WHERE msg.x > 10",


let me rework the testcases
predicatePushDown will push down filter just on top of tablescan or values. It is OK we have dereference just on top of filter, as long as ScanFilterAndProject has dereferences, so that virtual nested columns could be generated by leveraging dereferences in connector. What do you think?

mbasmanova

@zhenxiao Some further questions and comments.

mbasmanova · 2019-03-21T14:21:06Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+        if (result.isEmpty()) {
+            return Result.empty();
+        }
+        ProjectNode target = new ProjectNode(context.getIdAllocator().getNextId(), result.getTransformedPlan().get(), assignments);


nit: inline this variable

mbasmanova · 2019-03-21T14:22:30Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownRule.java

+
+        DereferenceReplacer(Map<DereferenceExpression, Symbol> expressions)
+        {
+            this.expressions = expressions;


requireNonNull: this.expressions = requireNonNull(expressions, "expressions is null");

mbasmanova · 2019-03-21T14:28:46Z

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

+                new DereferencePushDownFilter(metadata, sqlParser),
+                new DereferencePushDownJoin(metadata, sqlParser),
+                new DereferencePushDownProject(metadata, sqlParser),
+                new DereferencePushDownSort(metadata, sqlParser));


This is a bit verbose. How do you feel about creating a top-level PushDownDereferences class and covert all these rules into inner classes?

dereferencePushDownRules = new PushDownDereferences(metadata, sqlParser).rules();

dereferencePushDownRules variable can then be inlined

mbasmanova · 2019-03-21T14:54:23Z

.../src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownFilter.java

+        List<Symbol> outputSymbols = filterNode.getOutputSymbols();
+        Map<Symbol, Expression> pushdownExpressions = expressions.entrySet().stream()
+                .filter(entry -> !isCastRowType(entry.getKey()))
+                .filter(entry -> outputSymbols.contains(getOnlyElement(extractAll(entry.getKey()))))


Why is it guaranteed that extractAll(entry.getKey()) returns just one element? I'm thinking of a dereference expression f(a, b).c which contains two symbols: a and b.

mbasmanova · 2019-03-21T14:54:52Z

.../src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownFilter.java

+    {
+        List<Symbol> outputSymbols = filterNode.getOutputSymbols();
+        Map<Symbol, Expression> pushdownExpressions = expressions.entrySet().stream()
+                .filter(entry -> !isCastRowType(entry.getKey()))


What's the logic behind checking for cast to row?

mbasmanova · 2019-03-21T15:27:50Z

.../src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownFilter.java

+                .collect(toImmutableMap(Map.Entry::getValue, Map.Entry::getKey));
+
+        if (pushdownExpressions.isEmpty()) {
+            return Result.empty();


Why exit here? What if the predicate has dereferences that can be pushed down?

mbasmanova · 2019-03-21T15:32:18Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+
+        ImmutableList.Builder<Symbol> outputBuilder = ImmutableList.builder();
+        outputBuilder.addAll(leftChild.getOutputSymbols()).addAll(rightChild.getOutputSymbols());
+        JoinNode result = new JoinNode(context.getIdAllocator().getNextId(), joinNode.getType(), leftChild, rightChild, joinNode.getCriteria(), outputBuilder.build(), joinFilter, joinNode.getLeftHashSymbol(), joinNode.getRightHashSymbol(), joinNode.getDistributionType());


This line is too long. It might be easier to read if rewritten to

return Result.ofPlanNode( new JoinNode( context.getIdAllocator().getNextId(), joinNode.getType(), leftChild, rightChild, joinNode.getCriteria(), ImmutableList.<Symbol>builder() .addAll(leftChild.getOutputSymbols()) .addAll(rightChild.getOutputSymbols()) .build(), joinNode.getFilter().map(expression -> replaceDereferences(expression, expressions)), joinNode.getLeftHashSymbol(), joinNode.getRightHashSymbol(), joinNode.getDistributionType()));

mbasmanova · 2019-03-21T15:32:39Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+        return Result.ofPlanNode(result);
+    }
+
+    private EquiJoinClause getEquiJoinClause(Expression expression)


this method is not used

mbasmanova · 2019-03-21T15:35:15Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+        PlanNode right = joinNode.getRight();
+
+        Assignments.Builder leftBuilder = Assignments.builder();
+        List<Symbol> leftOutputs = left.getOutputSymbols().stream()


inline this variable and rightOutputs

Assignments.Builder leftBuilder = Assignments.builder(); leftBuilder.putIdentities(left.getOutputSymbols().stream() .filter(symbol -> !symbolsMap.containsKey(symbol)) .collect(toImmutableList())); Assignments.Builder rightBuilder = Assignments.builder(); rightBuilder.putIdentities(right.getOutputSymbols().stream() .filter(symbol -> !symbolsMap.containsKey(symbol)) .collect(toImmutableList()));

mbasmanova · 2019-03-21T15:39:43Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/DereferencePushDownJoin.java

+            return Result.empty();
+        }
+
+        ImmutableMap.Builder<Symbol, Symbol> symbolsBuilder = ImmutableMap.builder();


This can be simplified as

Map<Symbol, Symbol> symbolsMap = pushdownExpressions.entrySet().stream() .collect(toImmutableMap(entry -> getOnlyElement(extractAll(entry.getValue())), Map.Entry::getKey));

zhenxiao · 2019-03-22T02:11:52Z

thank you, @mbasmanova
I get comments addressed

A few updates:

we need dereference pushdown for Project(Join). Or joins would output whole struct.
we need dereference pushdown for Project(Project), Project(Sort), so that dereferences could pass through Project and Sort, and down to Joins, if there is any.
ScanFilterAndProject will not change, as projections belongs to the same table are merged into ScanFilterAndProject
do not need Project(Filter), Project(Aggregation). Project either already happens at lower level, or predicate pushdown will guarantee Project(Filter(TableScan))), which is OK, as ScanFilterAndProject has all the projections, we could leverage it for following steps, e.g. getNestedColumnHandles.

mbasmanova

@zhenxiao Some further questions and comments.

mbasmanova · 2019-03-22T17:00:24Z

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

@@ -200,6 +201,7 @@ public PlanOptimizers(
        Set<Rule<?>> predicatePushDownRules = ImmutableSet.of(
                new MergeFilters());

+        Set<Rule<?>> dereferencePushDownRules = new PushDownDereferences(metadata, sqlParser).rules();


inline this variable

mbasmanova · 2019-03-22T17:04:24Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        }
+    }
+
+/**


wrong indentation

mbasmanova · 2019-03-22T17:05:00Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+                new DereferencePushDownProject(metadata, sqlParser));
+    }
+
+    public abstract class DereferencePushDownRule<N extends PlanNode>


mbasmanova · 2019-03-22T17:09:57Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+ * Transforms:
+ * <pre>
+ *  Project(a.msg.x)
+ *    Join(a.msg.y = b.msg.y)


This is confusing because join's equi-clause can't contain expressions. The following might be clearer:

* Transforms: * <pre> * Project(a_x := a.msg.x) * Join(a_y = b_y) => [a] * Project(a_y := a.msg.y) * Source(a) * Project(b_y := b.msg.y) * Source(b) * </pre> * to: * <pre> * Join(a_y = b_y) => [a_x] * Project(a_x := a.msg.x, a_y := a.msg.y) * Source(a) * Project(b_y := b.msg.y) * Source(b) * </pre>

mbasmanova · 2019-03-22T17:10:28Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+ *      Source(b)
+ *  </pre>
+ */
+    public class DereferencePushDownJoin


PushDownDereferenceThroughJoin might be a better name

mbasmanova · 2019-03-22T17:26:39Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        }
+    }
+
+    protected static Expression replaceDereferences(Expression expression, Map<DereferenceExpression, Symbol> replacements)


mbasmanova · 2019-03-22T17:26:44Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        }
+    }
+
+    protected static List<DereferenceExpression> extractDereferenceExpressions(Expression expression)


mbasmanova · 2019-03-22T17:26:59Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+                .collect(toImmutableMap(Function.identity(), expression -> newSymbol(expression, context, metadata, sqlParser)));
+    }
+
+    protected static Symbol newSymbol(Expression expression, Context context, Metadata metadata, SqlParser sqlParser)


presto-main/src/test/java/com/facebook/presto/sql/planner/TestDereferencePushDown.java

mbasmanova · 2019-03-22T17:40:01Z

presto-main/src/test/java/com/facebook/presto/sql/planner/TestDereferencePushDown.java

+    @Test
+    public void testDereferencePushdownJoin()
+    {
+        assertPlan("WITH t(msg) AS ( SELECT * FROM (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE))))) " +


field, field1, left and right names and not easy to follow. How about msg, a_y, b_y, etc.?

assertPlan("WITH t(msg) AS (SELECT * FROM (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE))))) " + "SELECT b.msg.x FROM t a, t b WHERE a.msg.y = b.msg.y", output(ImmutableList.of("b_x"), join(INNER, ImmutableList.of(equiJoinClause("a_y", "b_y")), anyTree( project(ImmutableMap.of("a_y", expression("msg.y")), values("msg")) ), anyTree( project(ImmutableMap.of("b_y", expression("msg.y"), "b_x", expression("msg.x")), values("msg"))))));

zhenxiao

thank you, @mbasmanova
I get comments addressed
will add more testcase coverage. could you please review when u are free?

zhenxiao · 2019-03-22T21:39:58Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        {
+            List<Symbol> outputSymbols = joinNode.getOutputSymbols();
+            Map<Symbol, Expression> pushdownExpressions = expressions.entrySet().stream()
+                    .filter(entry -> !isCastRowType(entry.getKey()))


no need to pushdown (Cast As Row).field dereference, since lower level does not have this as Row.
Fix me if my understanding is wrong

zhenxiao · 2019-03-22T21:49:36Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+ *  Project(a.key)
+ *    Project(a.msg.x)
+ *      Source(a)
+ *  </pre>


my bad. trying to mean key is a primitive type, but msg is a struct type, only pushdown dereferences. Will fix it

zhenxiao · 2019-03-23T08:25:54Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+ *    Sort
+ *      Project(a.msg.x)
+ *        Source(a)
+ *  </pre>


my bad, it is push dereference through sort. Will fix and add unnest

mbasmanova

@zhenxiao Some further questions.

mbasmanova · 2019-03-26T14:52:36Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        {
+            List<Symbol> outputSymbols = joinNode.getOutputSymbols();
+            Map<Symbol, Expression> pushdownExpressions = expressions.entrySet().stream()
+                    .filter(entry -> !isCastRowType(entry.getKey()))


Indeed, but the same can be said about any dereference where base is not a symbol, e.g. f(a).x. Why is it important to filter out cast, but not other functions? Could you add a test that covers these cases?

mbasmanova · 2019-03-26T14:58:26Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+            List<Symbol> outputSymbols = joinNode.getOutputSymbols();
+            Map<Symbol, Expression> projectExpressions = expressions.entrySet().stream()
+                    .filter(entry -> !isCastRowType(entry.getKey()))
+                    .filter(entry -> outputSymbols.containsAll(extractAll(entry.getKey())))


I still don't understand this filter. entry.getKey() is a dereference expression coming from a projection above this join. extractAll(entry.getKey()) returns all symbols that are used in that dereference expression. outputSymbols are symbols produced by the join; these are inputs to the project node above. With project over join, the project can only use output symbols of the join. Hence, this check must always be true. Hence, it is not needed. Am I missing something? I commented out this check and the above check for cast and ran the tests. All passed.

mbasmanova · 2019-03-26T15:03:57Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        public Result apply(ProjectNode node, Captures captures, Context context)
+        {
+            N child = captures.get(targetCapture);
+            Map<DereferenceExpression, Symbol> expressions = getDereferenceSymbolMap(node.getAssignments().getExpressions().stream().collect(toImmutableList()), context, metadata, sqlParser);


.stream().collect(toImmutableList()) is unnecessary; just change the type of the first argument of getDereferenceSymbolMap to Collection<Expression>.

mbasmanova · 2019-03-26T15:05:36Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+            dereferenceSymbolsBuilder.putAll(expressions);
+            if (joinNode.getFilter().isPresent()) {
+                Map<DereferenceExpression, Symbol> predicateSymbols = getDereferenceSymbolMap(ImmutableList.of(joinNode.getFilter().get()), context, metadata, sqlParser).entrySet().stream()
+                        .filter(entry -> !projectExpressions.values().contains(entry.getKey()))


Shouldn't this filter use baseExists?

No. This is to filter out join filters dereferences that not covered in projectExpressions.
baseExists is to filter out dereferences already exists in other dereferences base

mbasmanova · 2019-03-26T15:22:45Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+    {
+        return ImmutableSet.of(
+                new PushDownDereferenceThroughJoin(metadata, sqlParser),
+                new PushDownDereferenceThroughSort(metadata, sqlParser),


Could you scan through all the node types and see if there are more nodes to add here? I'm thinking about AssignUniqueId, MarkDistinct, Limit and SemiJoin.

mbasmanova · 2019-03-26T15:23:25Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+     *  </pre>
+     * to:
+     * <pre>
+     *  Project(a_x := a_z)


This project node is redundant. Why not remove it?

there is RemoveRedundantIdentityProjections following PushDownDereferences, so that redundant project node is removed. I will remove it from comments. Or, do you think we should add RemoveRedundantIdentityProjections to PushDownDereferences?

mbasmanova · 2019-03-26T15:23:39Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+     *  </pre>
+     * to:
+     * <pre>
+     *  Project(a_x := a_y)


This project node is redundant.

mbasmanova · 2019-03-26T15:23:54Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+     *  </pre>
+     * to:
+     * <pre>
+     *  Project(a_x := a_y)


This project node is redundant.

mbasmanova · 2019-03-26T15:28:20Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+     *  </pre>
+     * to:
+     * <pre>
+     *  Join(a_y = b_y) => [a_x]


Other examples show an identity projection in the result, but it is missing here. At the same time the code seems to be actually generating an identity projection and there is no logic to remove it. Let's add it here or add logic to remove it to the rule.

mbasmanova · 2019-03-26T15:29:40Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+                .collect(toImmutableSet());
+
+        return dereferences.stream()
+                .filter(expression -> !baseExists(expression, dereferences))


Tests only cover one-level deep dereferences, e.g. msg.x and msg.y. Would you add tests with more levels to test the logic of consolidating a.b.c and a.b into a.b?

zhenxiao

thank you, @mbasmanova
I get comments addressed
will add support for AssignUniqueId, MarkDistinct, Limit, Aggregation and SemiJoin

zhenxiao · 2019-04-02T00:44:46Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+            dereferenceSymbolsBuilder.putAll(expressions);
+            if (joinNode.getFilter().isPresent()) {
+                Map<DereferenceExpression, Symbol> predicateSymbols = getDereferenceSymbolMap(ImmutableList.of(joinNode.getFilter().get()), context, metadata, sqlParser).entrySet().stream()
+                        .filter(entry -> !projectExpressions.values().contains(entry.getKey()))


No. This is to filter out join filters dereferences that not covered in projectExpressions.
baseExists is to filter out dereferences already exists in other dereferences base

zhenxiao · 2019-04-02T01:13:52Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+        {
+            List<Symbol> outputSymbols = joinNode.getOutputSymbols();
+            Map<Symbol, Expression> pushdownExpressions = expressions.entrySet().stream()
+                    .filter(entry -> !isCastRowType(entry.getKey()))


yep, should only pushdown dereferences with base as DeferenceExpression or SymbolReference. Will fix it

zhenxiao · 2019-04-02T01:17:51Z

...-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/PushDownDereferences.java

+     *  </pre>
+     * to:
+     * <pre>
+     *  Project(a_x := a_z)


there is RemoveRedundantIdentityProjections following PushDownDereferences, so that redundant project node is removed. I will remove it from comments. Or, do you think we should add RemoveRedundantIdentityProjections to PushDownDereferences?

mbasmanova · 2019-04-02T19:16:36Z

@zhenxiao

I get comments addressed
will add support for AssignUniqueId, MarkDistinct, Limit, Aggregation and SemiJoin

Sounds good. Ping me when the changes are ready for review.

phd3 · 2019-04-15T23:42:05Z

@zhenxiao I was trying your patch on some examples and the following query fails with the PushDownDereferences optimizer.

with t(myint, arr) as
          (SELECT * FROM (VALUES (5, CAST(ARRAY[ROW(1,ROW(2,5)),ROW(2,ROW(3,9))] AS ARRAY(ROW(BIGINT, ROW(a BIGINT, b BIGINT)))))))
          select w.colb.a from t cross join unnest(arr) as w(cola, colb)

Plan for this query without the optimizer:

- Output[a] => [expr_16:bigint]
        a := expr_16
    - Project[] => [expr_16:bigint]
            expr_16 := "field_15".a
        - Unnest[replicate=, unnest=field_0:array(row(bigint, row(a bigint, b bigint)))] => [field_14:bigint, field_15:row(a bigint, b bigint)]
            - Values => [field_0:array(row(bigint, row(a bigint, b bigint)))]
                    (CAST($literal$array(row(integer,row(integer,integer)))(from_base64(VARCHAR(184) AwAAAFJPVw...))))

Do you know why that might be the case?

phd3 · 2019-04-15T23:48:45Z

@zhenxiao Can you please add some tests for the unnest queries on multi-level complex data types in arrays ? for example, pushdown in cases where only field y.w.u is accessed after unnesting array(row(x BIGINT, y row(z BIGINT, w row(u BIGINT, v DOUBLE))))) ?

The testcases will help us have a better documentation of what this feature does and does not support.

mbasmanova · 2019-08-27T11:48:38Z

I assume #13271 superceds this one, hence, closing.

ghost added the CLA Signed label Jun 30, 2016

zhenxiao force-pushed the nested-pruning branch from 90e4bb0 to f602542 Compare June 30, 2016 01:32

martint reviewed Jun 30, 2016
View reviewed changes

zhenxiao force-pushed the nested-pruning branch 2 times, most recently from a3a8c62 to da929c1 Compare July 4, 2016 02:13

zhenxiao mentioned this pull request Jul 7, 2016

Make RowType fields available in SPI? #5601

Closed

zhenxiao reviewed Jul 7, 2016
View reviewed changes

ghost added the CLA Signed label Jul 12, 2016

ghost added the CLA Signed label Jul 13, 2016

nezihyigitbasi reviewed Jul 13, 2016
View reviewed changes

mbasmanova requested changes Mar 19, 2019

View reviewed changes

mbasmanova self-assigned this Mar 19, 2019

zhenxiao force-pushed the nested-pruning branch from 57cfc92 to 2023fff Compare March 20, 2019 03:25

zhenxiao commented Mar 20, 2019

View reviewed changes

mbasmanova reviewed Mar 20, 2019

View reviewed changes

zhenxiao force-pushed the nested-pruning branch from 2023fff to f7f969f Compare March 21, 2019 08:34

zhenxiao commented Mar 21, 2019

View reviewed changes

mbasmanova reviewed Mar 21, 2019

View reviewed changes

zhenxiao force-pushed the nested-pruning branch from f7f969f to 2bb0926 Compare March 22, 2019 02:03

mbasmanova requested changes Mar 22, 2019

View reviewed changes

zhenxiao force-pushed the nested-pruning branch from 2bb0926 to c4ae7b2 Compare March 23, 2019 08:35

zhenxiao commented Mar 23, 2019

View reviewed changes

zhenxiao force-pushed the nested-pruning branch 2 times, most recently from 099c844 to 4ab0fa9 Compare March 26, 2019 06:53

mbasmanova reviewed Mar 26, 2019

View reviewed changes

Pushdown Dereference Expressions

0966894

zhenxiao force-pushed the nested-pruning branch from 4ab0fa9 to 0966894 Compare April 2, 2019 01:41

zhenxiao commented Apr 2, 2019

View reviewed changes

This was referenced May 22, 2019

Prune Nested Fields for Parquet Columns lyft/presto#2

Closed

Prune nested fields trinodb/trino#812

Closed

qqibrow mentioned this pull request Aug 1, 2019

Push down dereference expression #13180

Closed

mbasmanova closed this Aug 27, 2019

zhenxiao mentioned this pull request Jun 11, 2020

Push down dereference expression #14637

Closed

zhenxiao deleted the nested-pruning branch January 22, 2022 15:33

+                      }
+                  }
+              /**

Prune Nested Fields for Parquet Columns #5547

Prune Nested Fields for Parquet Columns #5547

Conversation

zhenxiao commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao commented Jul 8, 2016

cberner commented Jul 12, 2016

nezihyigitbasi commented Jul 12, 2016

nezihyigitbasi commented Jul 13, 2016

zhenxiao commented Jul 13, 2016

Choose a reason for hiding this comment

zhenxiao commented Mar 19, 2019

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao commented Mar 22, 2019

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment