Subfield pruning in Parquet #13271

zhenxiao · 2019-08-22T21:38:39Z

== RELEASE NOTES ==

Hive Changes
--------------
*  Add subfield pruning to reading of Parquet files so that only necessary subfields are extracts from struct columns.

oerling · 2019-08-23T21:27:00Z

What about tests? We’d like to have a and a.b both be structs with s* as scalar members and then do select a, a.s1, a.b, a.b.s2 in different combinations. Also both a and a.b should have nulls. If this works this seems good, at least this is very compact if the functionality is complete. From: Maria Basmanova <notifications@github.com> Sent: Thursday, August 22, 2019 4:21 PM To: prestodb/presto <presto@noreply.github.com> Cc: oerling <erling@xs4all.nl>; Review requested <review_requested@noreply.github.com> Subject: Re: [prestodb/presto] Subfield pruning in Parquet (#13271) @mbasmanova <https://github.com/mbasmanova> requested review from @prestodb/aria on: #13271 <#13271> Subfield pruning in Parquet. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#13271?email_source=notifications&email_token=AKPPPTYKC2CGXH6K6KKPBKLQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI#event-2578113737> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT37HOIXSG2D7QW555DQF4NOJANCNFSM4IOZQHCA> . <https://github.com/notifications/beacon/AKPPPT33YRO3J6EAPCDGQ2LQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI.gif>

mbasmanova

@zhenxiao Looks nice. It is cool that subfield pruning can be achieved with so few lines of code. Like Orri mentioned, it would be good to add some tests. I don't see integration tests for Parquet in the code base though. I suggest to start by adding TestParquestDistributedQueries that mimics TestHiveDistributedQueries, then add subfield pruning tests there.

Here is a draft:

public class TestParquestDistributedQueries
        extends AbstractTestDistributedQueries
{
    protected TestParquestDistributedQueries()
    {
        super(TestParquestDistributedQueries::createQueryRunner);
    }

    private static QueryRunner createQueryRunner()
            throws Exception
    {
        return HiveQueryRunner.createQueryRunner(getTables(),
                ImmutableMap.of("experimental.pushdown-subfields-enabled", "true"),
                "sql-standard",
                ImmutableMap.of("hive.storage-format", "PARQUET"),
                Optional.empty());
    }

    @Test
    public void testSubfieldPruning()
    {
        getQueryRunner().execute("CREATE TABLE test_subfield_pruning AS " +
                "SELECT orderkey, linenumber, shipdate, " +
                "   CAST(ROW(orderkey, linenumber, ROW(day(shipdate), month(shipdate), year(shipdate))) " +
                "       AS ROW(orderkey BIGINT, linenumber INTEGER, shipdate ROW(ship_day TINYINT, ship_month TINYINT, ship_year INTEGER))) AS info " +
                "FROM lineitem");

        try {
            assertQuery("SELECT info.orderkey, info.shipdate.ship_month FROM test_subfield_pruning", "SELECT orderkey, month(shipdate) FROM lineitem");

            assertQuery("SELECT orderkey FROM test_subfield_pruning WHERE info.shipdate.ship_month % 2 = 0", "SELECT orderkey FROM lineitem WHERE month(shipdate) % 2 = 0");
        }
        finally {
            getQueryRunner().execute("DROP TABLE test_subfield_pruning");
        }
    }

    @Override
    protected boolean supportsNotNullColumns()
    {
        return false;
    }

    @Override
    public void testDelete()
    {
        // Hive connector currently does not support row-by-row delete
    }

    @Test
    public void testExplainOfCreateTableAs()
    {
        String query = "CREATE TABLE copy_orders AS SELECT * FROM orders";
        MaterializedResult result = computeActual("EXPLAIN " + query);
        assertEquals(getOnlyElement(result.getOnlyColumnAsSet()), getExplainPlan(query, LOGICAL));
    }
}

mbasmanova · 2019-08-26T15:52:01Z

presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java

+            }
+        }
+
+        List<org.apache.parquet.schema.Type> paths = typeBuilder.build();


I'd rename paths to types or subfieldTypes because this variable holds a list of types, not paths.

mbasmanova · 2019-08-26T15:52:45Z

presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java

+        if (paths.isEmpty()) {
+            return new MessageType(subfield.getRootName(), ImmutableList.of());
+        }
+        else {


drop else after return

inline result variable

if (types.isEmpty()) { return new MessageType(subfield.getRootName(), ImmutableList.of()); } org.apache.parquet.schema.Type type = types.get(types.size() - 1); for (int i = types.size() - 2; i >= 0; --i) { GroupType groupType = types.get(i).asGroupType(); type = new MessageType(groupType.getName(), ImmutableList.of(type)); } return new MessageType(subfield.getRootName(), ImmutableList.of(type));

zhenxiao · 2019-09-11T08:09:34Z

thank you, @oerling @mbasmanova
I get comments addressed, and testcase added.
could you please review?

mbasmanova

@zhenxiao Looks good to me. Thanks for implementing this optimization.

facebook-github-bot added the CLA Signed label Aug 22, 2019

mbasmanova self-assigned this Aug 22, 2019

mbasmanova requested a review from a team August 22, 2019 23:21

mbasmanova requested changes Aug 26, 2019

View reviewed changes

This was referenced Aug 27, 2019

Prune Nested Fields for Parquet Columns #5547

Closed

Nested predicate push down to Parquet Reader #7045

Closed

zhenxiao mentioned this pull request Aug 28, 2019

Refine Parquet schema mismatch message #12550

Merged

zhenxiao force-pushed the parquet-subfield-pruning branch from 3a61b8f to b5476e8 Compare September 11, 2019 05:37

Subfield pruning in Parquet

d9f9803

zhenxiao force-pushed the parquet-subfield-pruning branch from b5476e8 to d9f9803 Compare September 11, 2019 08:07

mbasmanova requested a review from a team September 11, 2019 08:36

mbasmanova approved these changes Sep 11, 2019

View reviewed changes

zhenxiao merged commit 35ce8ac into prestodb:master Sep 13, 2019

yingsu00 mentioned this pull request Oct 2, 2019

Release notes for 0.227 #13490

Closed

3 tasks

zhenxiao deleted the parquet-subfield-pruning branch March 3, 2020 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subfield pruning in Parquet #13271

Subfield pruning in Parquet #13271

zhenxiao commented Aug 22, 2019 •

edited by mbasmanova

Loading

oerling commented Aug 23, 2019 via email

mbasmanova left a comment

mbasmanova Aug 26, 2019

mbasmanova Aug 26, 2019

zhenxiao commented Sep 11, 2019

mbasmanova left a comment

Subfield pruning in Parquet #13271

Subfield pruning in Parquet #13271

Conversation

zhenxiao commented Aug 22, 2019 • edited by mbasmanova Loading

oerling commented Aug 23, 2019 via email

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova Aug 26, 2019

Choose a reason for hiding this comment

mbasmanova Aug 26, 2019

Choose a reason for hiding this comment

zhenxiao commented Sep 11, 2019

mbasmanova left a comment

Choose a reason for hiding this comment

zhenxiao commented Aug 22, 2019 •

edited by mbasmanova

Loading