-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ANALYZE statement to collect column statistics on demand #11376
Conversation
19dff4f
to
2f7dab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skimmed.
@@ -400,6 +401,27 @@ protected Scope visitDelete(Delete node, Optional<Scope> scope) | |||
return createAndAssignScope(node, scope, Field.newUnqualified("rows", BIGINT)); | |||
} | |||
|
|||
@Override | |||
protected Scope visitAnalyze(Analyze node, Optional<Scope> scope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you going to update access controls also, so we could check if user is privileged to analyze the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need access control? If you can read the table, why can't you analyze it? Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because one may want to disable that for certain users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANALYZE changes the state of the table. It should require read+write access to the table. I am not sure if we need a separate "can analyze" privilege.
analyze("ANALYZE t1 WITH (p1 = 'p1')"); | ||
|
||
assertFails(DUPLICATE_PROPERTY, ".* Duplicate property: p1", "ANALYZE t1 WITH (p1 = 'p1', p2 = 'p2', p1 = 'p3')"); | ||
assertFails(DUPLICATE_PROPERTY, ".* Duplicate property: p1", "ANALYZE t1 WITH (p1 = 'p1', \"p1\" = 'p2')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you pass properties like p1=p1,p1=p1
, would it pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I assume you mean p1='p1', p1='p1'
. The query will fail and the analyzer will report duplicate properties p1
. I think this behavior is expected.
public List<? extends Node> getChildren() | ||
{ | ||
return ImmutableList.<Node>builder() | ||
.addAll(properties).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what about the
table
? - move
.build()
to separate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I missed that when changing from a QualifiedName
member to a Table
member.
@@ -60,7 +60,8 @@ | |||
@JsonSubTypes.Type(value = ExplainAnalyzeNode.class, name = "explainAnalyze"), | |||
@JsonSubTypes.Type(value = ApplyNode.class, name = "apply"), | |||
@JsonSubTypes.Type(value = AssignUniqueId.class, name = "assignUniqueId"), | |||
@JsonSubTypes.Type(value = LateralJoinNode.class, name = "lateralJoin")}) | |||
@JsonSubTypes.Type(value = LateralJoinNode.class, name = "lateralJoin"), | |||
@JsonSubTypes.Type(value = AnalyzeFinishNode.class, name = "analyzecommit")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention here is not consistent... I followed the one from TableFinishNode.class
but I don't think it's a good example (e.g. missing upper case and mismatched name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the existing ones are a mess. I would go with analyzeFinish
as that makes the name match and the camel case makes it slightly more readable.
shouldn't this be |
@findepi I didn't use |
I didn't think about this. Did you check what would it take to remove this staticness constraint? |
return tableMetadata; | ||
} | ||
|
||
public ConnectorTableHandle getTableHandle() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this here. This analyze metadata object is created from a table handle, so the caller can hold onto it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the getAnalyzeTableMetadata()
to return an updated table handle which has dependency on the analyze properties (e.g. what range of table to analyze). It will later be passed to getTableLayouts()
return statisticsMetadata; | ||
} | ||
|
||
public ConnectorTableMetadata getTableMetadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need table metadata here? It seems more natural for the engine to ask for the table metadata separately, as normal.
Is there a use case for the metadata here to be different than normal metadata (maybe dependent on the analyze properties)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used that mainly to save a getTableMetadata()
call when planning table scan for the analyze statement. The connector will need to call getTableMetadata()
when generating TableStatisticsMetadata
.
Alternatively, we can add ColumnMetadata
to TableStatisticsMetadata
to describe the property of columns that we are collecting statistics for. Then we can use these information when planning the table scan.
I don't see any use case that we want to return different TableMetadata
for analyze statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used that mainly to save a getTableMetadata() call when planning table scan for the analyze statement.
It should be cheap. When including the Metadata here you are doing all the expensive operation (like going to the metastore, and so on) anyway. You are using this metadata in a single place. So it really seems to be just a matter of a Java virtual method invocation overhead. This is negligible, and the overhead is constant. So i woudn't bother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that, when generating the TableStatisticsMetadata
, the connector has already done a getTableMetadata()
which contains all the expensive operations. And in order to plan the table scan for ANALYZE
, we need a list of column handle which is not in TableStatisticsMetadata
. So I included a TableMetadata
object in the AnalyzeTableMetadata
to reuse it and avoid one more call.
Like I said, an alternative is to include the column information in TableStatisticsMetadata
to describe the properties of columns we are collecting statistics for. So that later we can use it to plan table scan for ANALYZE
. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strong about it. You can keep metadata in place if you would like.
@@ -285,6 +286,30 @@ default TableStatisticsMetadata getStatisticsCollectionMetadata(ConnectorSession | |||
return TableStatisticsMetadata.empty(); | |||
} | |||
|
|||
/** | |||
* Get metadata for an ANALYZE query, describing what statistics to collect and how to collect them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: lowercase "analyze", since we are talking about logical functionality, not specific SQL syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we actually mentioning the SQL syntax here? It's similar to the documentation for beginQuery()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @martint's comment below, something like "Get metadata for table analysis" seems more appropriate.
*/ | ||
default ConnectorAnalyzeTableHandle beginAnalyze(ConnectorSession session, ConnectorTableHandle tableHandle) | ||
{ | ||
throw new PrestoException(NOT_SUPPORTED, "This connector does not support analyze"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
throw new PrestoException(GENERIC_INTERNAL_ERROR, "ConnectorMetadata getAnalyzeTableMetadata() is implemented without beginAnalyze()");
public ConnectorAnalyzeTableHandle beginAnalyze(ConnectorSession session, ConnectorTableHandle tableHandle) | ||
{ | ||
// do nothing | ||
return new ConnectorAnalyzeTableHandle() {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work because the handle class isn't returned from TpchHandleResolver
(which is needed by the handle serialization system).
Add an ANALYZE
query in TestTpchDistributedQueries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added it for the TestLogicalPlanner
and didn't think about the consequence of actual execution. Will fix it.
*/ | ||
package com.facebook.presto.spi; | ||
|
||
public interface ConnectorAnalyzeTableHandle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: write this as
@SuppressWarnings("MarkerInterface")
public interface ConnectorAnalyzeTableHandle {}
PlanNode child = node.getSource(); | ||
child.accept(this, context); | ||
|
||
AnalyzeFinishNode.AnalyzeHandle materializedHandle = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be analyzeHandle
?
false, | ||
value -> ImmutableList.copyOf(((Collection<?>) value).stream() | ||
.map(partition -> ImmutableList.copyOf(((Collection<?>) partition).stream() | ||
.map(name -> ((String) name).toLowerCase(ENGLISH)).collect(Collectors.toList()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use toImmutableList()
in both places
List.class, | ||
ImmutableList.of(), | ||
false, | ||
value -> ImmutableList.copyOf(((Collection<?>) value).stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do a a copy of the collection before calling stream()
? We don't do that in other places. Just casting should be sufficient.
{ | ||
Set<ColumnStatisticMetadata> columnStatistics = columns.stream() | ||
.filter(column -> !partitionedBy.contains(column.getName())) | ||
.filter(column -> !column.isHidden()) | ||
.map(this::getColumnStatisticMetadata) | ||
.flatMap(List::stream) | ||
.collect(toImmutableSet()); | ||
return new TableStatisticsMetadata(columnStatistics, ImmutableSet.of(), partitionedBy); | ||
|
||
if (!includeRowCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be cleaner as
Set<TableStatisticType> statTypes = includeRowCount ? ImmutableSet.of(ROW_COUNT) : ImmutableSet.of();
return new TableStatisticsMetadata(columnStatistics, statTypes, partitionedBy);
This removes the duplication and makes it easy to see which part is variable.
.collect(toList()); | ||
|
||
Iterable<HivePartition> partitionsIterable = () -> partitionValues.stream() | ||
.map(partitionValue -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cleaner to extract a method for this lambda
@@ -27,14 +28,27 @@ | |||
{ | |||
private final String schemaName; | |||
private final String tableName; | |||
private final Optional<HiveAnalyzePropertiesHandle> analyzeHandle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this inner object a "handle" is a bit confusing since it's not a handle, it's just a bean. It would probably be cleaner to inline it here as analyzePartitions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a separate class because I though we might add new analyze properties in the future and it would be good if we can wrap all of them in a class. I can inline analyzePartitions
since we don't have the need now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to have a class if you or anyone else has specific ideas of what we would add. In that case, let's name it HiveAnalyzeProperties
and analyzeProperties
, as "handle" is confusing. Otherwise, inline it, since that makes it easier to read and we avoid adding stuff speculatively as it often goes unused and just clutters the code.
} | ||
|
||
@Override | ||
public String toString() | ||
{ | ||
return schemaName + ":" + tableName; | ||
return schemaName + ":" + tableName + ":" + analyzeHandle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This toString()
is used when printing the query plan. We probably don't want to include the partition values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to print it.
If the user is analyzing the whole table, the partition values filed here will be empty so nothing would be printed.
If the user specify a list of partitions to analyze, having them in the plan can help user verify if we actually analyzed those partitions. The plan will be as long as the analyze query itself.
An alternative way is to print up to N partitions here.
public synchronized void overwritePartitionStatistics(Table table, Map<List<String>, PartitionStatistics> partitionStatisticsMap) | ||
{ | ||
setExclusive((delegate, hdfsEnvironment) -> { | ||
for (Map.Entry<List<String>, PartitionStatistics> entry : partitionStatisticsMap.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use forEach
on Map
as that lets you name the key/value
List<String> columnNames = table.getPartitionColumns().stream() | ||
.map(Column::getName) | ||
.collect(toImmutableList()); | ||
return makePartName(columnNames, partitionValues); | ||
} | ||
|
||
private String getPartitionName(String databaseName, String tableName, List<String> partitionValues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move this method above, as you usually chain methods down rather than up (the root method is at the bottom of the file)
@@ -61,6 +61,7 @@ | |||
HIVE_TABLE_NOT_READABLE(34, USER_ERROR), | |||
HIVE_TABLE_DROPPED_DURING_QUERY(35, EXTERNAL), | |||
// HIVE_TOO_MANY_BUCKET_SORT_FILES(36) is deprecated | |||
HIVE_PARTITION_DOES_NOT_EXIST(37, USER_ERROR), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HIVE_PARTITION_NOT_FOUND
would be more consistent with other error codes
@findepi Today the when connector registering properties (e.g. session property, table property, column property), it register both the name and type of the property ( If we are going to remove that static constraint, we will need to change those things. But I am more worrying about the semantics than the implementation. Should we ever allow dynamic property type? What are the other use cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. Thanks for doing this.
@@ -400,6 +401,27 @@ protected Scope visitDelete(Delete node, Optional<Scope> scope) | |||
return createAndAssignScope(node, scope, Field.newUnqualified("rows", BIGINT)); | |||
} | |||
|
|||
@Override | |||
protected Scope visitAnalyze(Analyze node, Optional<Scope> scope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need access control? If you can read the table, why can't you analyze it? Why not?
@@ -508,6 +530,7 @@ protected Scope visitRenameSchema(RenameSchema node, Optional<Scope> scope) | |||
protected Scope visitCreateTable(CreateTable node, Optional<Scope> scope) | |||
{ | |||
validateProperties(node.getProperties(), scope); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: extra change
public class AnalyzeTableMetadata | ||
{ | ||
private final TableStatisticsMetadata statisticsMetadata; | ||
private final TableMetadata tableMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both here? TableHandle and TableMetadata? We can get TableMetadata
by calling getTableMetadata(..., tableHandle)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
|
||
Collection<ComputedStatistics> computedStatistics = computedStatisticsBuilder.build(); | ||
analyzeFinisher.finishAnalyze(computedStatistics); | ||
return new Page(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we decide to go with an empty? Maybe we should follow the convention for other DDLS and return single boolean column "result" with a value true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing it to return null
. But I am not sure if we should add an OutputNode
in the ANALYZE query plan in the first place.
analysis.getParameters()); | ||
|
||
AnalyzeTableMetadata analyzeTableMetadata = metadata.getAnalyzeTableMetadata(session, targetTable, analyzeProperties); | ||
// Replace the target table with the one from analyze table metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why do we need this. Say that TableHandle returned by getAnalyzeTableMetadata
may contain some additional information required for ANALYZE
.map(partitionValue -> getPartitionNamesByParts(metastore, tableName, partitionValue)) | ||
.flatMap(partitionNames -> partitionNames.stream() | ||
.map(partitionName -> parseValuesAndFilterPartition(tableName, partitionName, partitionColumns, partitionTypes, alwaysTrue()))) | ||
.filter(Optional::isPresent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't filter, but verify. It should never return empty
, since you are passing alwaysTrue()
domain
.map(Optional::get) | ||
.iterator(); | ||
|
||
return new HivePartitionResult(partitionColumns, partitionsIterable, all(), all(), all(), hiveBucketHandle, Optional.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all,all,none
table.getDatabaseName(), | ||
table.getTableName(), | ||
getPartitionName(table, partitionValues), | ||
statistics -> partitionStatistics))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve basic statistics here
@@ -293,6 +293,22 @@ public synchronized void renameDatabase(String source, String target) | |||
setExclusive((delegate, hdfsEnvironment) -> delegate.renameDatabase(source, target)); | |||
} | |||
|
|||
public synchronized void overwriteTableStatistics(Table table, PartitionStatistics tableStatistics) | |||
{ | |||
setExclusive((delegate, hdfsEnvironment) -> delegate.updateTableStatistics(table.getDatabaseName(), table.getTableName(), statistics -> tableStatistics)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserve basic statistics here
@@ -2776,6 +2776,210 @@ public void testCollectColumnStatisticsOnInsert() | |||
assertUpdate(format("DROP TABLE %s", tableName)); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesleeping Let me know if you need any help with figuring out product tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesleeping @arhimondr i am concerned about removing analyze-on-hive from product tests. We test with different Hive versions and they can be setting different table/partition properties during ANALYZE (now or some future versions, like Hive 3). We should have tests ensuring proper interop between Hive's ANALYZE and Presto's reading statistics.
However, as we have our ANALYZE now, we could try the contrary -- but I don't know if Hive has useful SHOW STATS (or equivalent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am concerned about removing analyze-on-hive from product tests
I was so looking forward to it. Running map resuces to analyze table in Hive used to slow down overall test run time a lot =) I don't feel strong about it, but i seems to be a little bit extra to test the Hive's analyze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not testing Hive's analyze. We are testing that Presto can handle table properties as those set by Hive's analyze.
What about defining some 3 tables (unanalyzed, analyzed, analyzed with columns) directly in the image and analyze it during image build time?
Of course, having test data setup in the image is not elastic (eg think we need to cover new data type), but maybe saved time pays off overall.
or, maybe, we can speed up analyze in Hive by using different execution framework?
While adding product test for The test still passed successfully. Is this a known bug? |
@jessesleeping the test asserts what hive is expected to produce as NDV for these columns. As with any estimate, this can deviate from actual values. In Hive, I observed NDV estimate deviating from actual number of distinct values by as much as 100% for small tables. |
b15f3e9
to
caa95cc
Compare
Fixed: Pending fixes: Pending discussions: |
caa95cc
to
e3c2731
Compare
e3c2731
to
b88b714
Compare
Update the PR with fixes. Pending commit:
|
e638d7c
to
ad13914
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error message on column count mismatch when making partition name
LGTM
Extracted-From: prestodb/presto#11376
Extracted-From: prestodb/presto#11376
Fix a bug where creating empty parition using CALL statement throws exceptions when using file based metastore implementation. Extracted-From: prestodb/presto#11376
Extracted-From: prestodb/presto#11376
Extracted-From: prestodb/presto#11376
013dc7f
to
ffa1b60
Compare
ffa1b60
to
b7079d3
Compare
Added the following SQL statement:
ANALYZE qualifiedName (WITH properties)?
User can trigger column statistic collection on a table by calling this statement. Connector decides what statistic to collect and how to store the result. Connector can also support certain
WITH
properties to customized the statistic collection. If theWITH
property is not specified, the default behavior is to collect default statistics for the whole table.The third commit implements the SPI in hive connector. It supports an analyze property named
partitions
which is expected to beARRAY[ARRAY[VARCHAR]]
. The value of this property is a list of partitions, where each partition is represented by a list of partition column value in varchar.