-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subfield pruning in Parquet #13271
Subfield pruning in Parquet #13271
Conversation
What about tests? We’d like to have a and a.b both be structs with s* as scalar members and then do select a, a.s1, a.b, a.b.s2 in different combinations. Also both a and a.b should have nulls.
If this works this seems good, at least this is very compact if the functionality is complete.
From: Maria Basmanova <notifications@github.com>
Sent: Thursday, August 22, 2019 4:21 PM
To: prestodb/presto <presto@noreply.github.com>
Cc: oerling <erling@xs4all.nl>; Review requested <review_requested@noreply.github.com>
Subject: Re: [prestodb/presto] Subfield pruning in Parquet (#13271)
@mbasmanova <https://github.com/mbasmanova> requested review from @prestodb/aria on: #13271 <#13271> Subfield pruning in Parquet.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub <#13271?email_source=notifications&email_token=AKPPPTYKC2CGXH6K6KKPBKLQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI#event-2578113737> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT37HOIXSG2D7QW555DQF4NOJANCNFSM4IOZQHCA> . <https://github.com/notifications/beacon/AKPPPT33YRO3J6EAPCDGQ2LQF4NOJA5CNFSM4IOZQHCKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTGVOJSI.gif>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenxiao Looks nice. It is cool that subfield pruning can be achieved with so few lines of code. Like Orri mentioned, it would be good to add some tests. I don't see integration tests for Parquet in the code base though. I suggest to start by adding TestParquestDistributedQueries
that mimics TestHiveDistributedQueries
, then add subfield pruning tests there.
Here is a draft:
public class TestParquestDistributedQueries
extends AbstractTestDistributedQueries
{
protected TestParquestDistributedQueries()
{
super(TestParquestDistributedQueries::createQueryRunner);
}
private static QueryRunner createQueryRunner()
throws Exception
{
return HiveQueryRunner.createQueryRunner(getTables(),
ImmutableMap.of("experimental.pushdown-subfields-enabled", "true"),
"sql-standard",
ImmutableMap.of("hive.storage-format", "PARQUET"),
Optional.empty());
}
@Test
public void testSubfieldPruning()
{
getQueryRunner().execute("CREATE TABLE test_subfield_pruning AS " +
"SELECT orderkey, linenumber, shipdate, " +
" CAST(ROW(orderkey, linenumber, ROW(day(shipdate), month(shipdate), year(shipdate))) " +
" AS ROW(orderkey BIGINT, linenumber INTEGER, shipdate ROW(ship_day TINYINT, ship_month TINYINT, ship_year INTEGER))) AS info " +
"FROM lineitem");
try {
assertQuery("SELECT info.orderkey, info.shipdate.ship_month FROM test_subfield_pruning", "SELECT orderkey, month(shipdate) FROM lineitem");
assertQuery("SELECT orderkey FROM test_subfield_pruning WHERE info.shipdate.ship_month % 2 = 0", "SELECT orderkey FROM lineitem WHERE month(shipdate) % 2 = 0");
}
finally {
getQueryRunner().execute("DROP TABLE test_subfield_pruning");
}
}
@Override
protected boolean supportsNotNullColumns()
{
return false;
}
@Override
public void testDelete()
{
// Hive connector currently does not support row-by-row delete
}
@Test
public void testExplainOfCreateTableAs()
{
String query = "CREATE TABLE copy_orders AS SELECT * FROM orders";
MaterializedResult result = computeActual("EXPLAIN " + query);
assertEquals(getOnlyElement(result.getOnlyColumnAsSet()), getExplainPlan(query, LOGICAL));
}
}
} | ||
} | ||
|
||
List<org.apache.parquet.schema.Type> paths = typeBuilder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename paths to types
or subfieldTypes
because this variable holds a list of types, not paths.
if (paths.isEmpty()) { | ||
return new MessageType(subfield.getRootName(), ImmutableList.of()); | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- drop else after return
- inline
result
variable
if (types.isEmpty()) {
return new MessageType(subfield.getRootName(), ImmutableList.of());
}
org.apache.parquet.schema.Type type = types.get(types.size() - 1);
for (int i = types.size() - 2; i >= 0; --i) {
GroupType groupType = types.get(i).asGroupType();
type = new MessageType(groupType.getName(), ImmutableList.of(type));
}
return new MessageType(subfield.getRootName(), ImmutableList.of(type));
3a61b8f
to
b5476e8
Compare
b5476e8
to
d9f9803
Compare
thank you, @oerling @mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenxiao Looks good to me. Thanks for implementing this optimization.
@mbasmanova @nezihyigitbasi