Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subfield pruning in Parquet #13271

Merged
merged 1 commit into from
Sep 13, 2019

Conversation

zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Aug 22, 2019

@mbasmanova @nezihyigitbasi

== RELEASE NOTES ==

Hive Changes
--------------
*  Add subfield pruning to reading of Parquet files so that only necessary subfields are extracts from struct columns.

@mbasmanova mbasmanova self-assigned this Aug 22, 2019
@mbasmanova mbasmanova requested a review from a team August 22, 2019 23:21
@oerling
Copy link

oerling commented Aug 23, 2019 via email

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao Looks nice. It is cool that subfield pruning can be achieved with so few lines of code. Like Orri mentioned, it would be good to add some tests. I don't see integration tests for Parquet in the code base though. I suggest to start by adding TestParquestDistributedQueries that mimics TestHiveDistributedQueries, then add subfield pruning tests there.

Here is a draft:

public class TestParquestDistributedQueries
        extends AbstractTestDistributedQueries
{
    protected TestParquestDistributedQueries()
    {
        super(TestParquestDistributedQueries::createQueryRunner);
    }

    private static QueryRunner createQueryRunner()
            throws Exception
    {
        return HiveQueryRunner.createQueryRunner(getTables(),
                ImmutableMap.of("experimental.pushdown-subfields-enabled", "true"),
                "sql-standard",
                ImmutableMap.of("hive.storage-format", "PARQUET"),
                Optional.empty());
    }

    @Test
    public void testSubfieldPruning()
    {
        getQueryRunner().execute("CREATE TABLE test_subfield_pruning AS " +
                "SELECT orderkey, linenumber, shipdate, " +
                "   CAST(ROW(orderkey, linenumber, ROW(day(shipdate), month(shipdate), year(shipdate))) " +
                "       AS ROW(orderkey BIGINT, linenumber INTEGER, shipdate ROW(ship_day TINYINT, ship_month TINYINT, ship_year INTEGER))) AS info " +
                "FROM lineitem");

        try {
            assertQuery("SELECT info.orderkey, info.shipdate.ship_month FROM test_subfield_pruning", "SELECT orderkey, month(shipdate) FROM lineitem");

            assertQuery("SELECT orderkey FROM test_subfield_pruning WHERE info.shipdate.ship_month % 2 = 0", "SELECT orderkey FROM lineitem WHERE month(shipdate) % 2 = 0");
        }
        finally {
            getQueryRunner().execute("DROP TABLE test_subfield_pruning");
        }
    }

    @Override
    protected boolean supportsNotNullColumns()
    {
        return false;
    }

    @Override
    public void testDelete()
    {
        // Hive connector currently does not support row-by-row delete
    }

    @Test
    public void testExplainOfCreateTableAs()
    {
        String query = "CREATE TABLE copy_orders AS SELECT * FROM orders";
        MaterializedResult result = computeActual("EXPLAIN " + query);
        assertEquals(getOnlyElement(result.getOnlyColumnAsSet()), getExplainPlan(query, LOGICAL));
    }
}

}
}

List<org.apache.parquet.schema.Type> paths = typeBuilder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename paths to types or subfieldTypes because this variable holds a list of types, not paths.

if (paths.isEmpty()) {
return new MessageType(subfield.getRootName(), ImmutableList.of());
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • drop else after return
  • inline result variable
        if (types.isEmpty()) {
            return new MessageType(subfield.getRootName(), ImmutableList.of());
        }

        org.apache.parquet.schema.Type type = types.get(types.size() - 1);
        for (int i = types.size() - 2; i >= 0; --i) {
            GroupType groupType = types.get(i).asGroupType();
            type = new MessageType(groupType.getName(), ImmutableList.of(type));
        }

        return new MessageType(subfield.getRootName(), ImmutableList.of(type));

@zhenxiao zhenxiao force-pushed the parquet-subfield-pruning branch from b5476e8 to d9f9803 Compare September 11, 2019 08:07
@zhenxiao
Copy link
Collaborator Author

thank you, @oerling @mbasmanova
I get comments addressed, and testcase added.
could you please review?

@mbasmanova mbasmanova requested a review from a team September 11, 2019 08:36
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao Looks good to me. Thanks for implementing this optimization.

@zhenxiao zhenxiao merged commit 35ce8ac into prestodb:master Sep 13, 2019
@yingsu00 yingsu00 mentioned this pull request Oct 2, 2019
3 tasks
@zhenxiao zhenxiao deleted the parquet-subfield-pruning branch March 3, 2020 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants