-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION
#31101
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #133868 has finished for PR 31101 at commit
|
@dongjoon-hyun @sunchao @imback82 @cloud-fan May I ask you to review this PR. |
Sure, I'll do today, @MaxGekk |
checkAnswer(sql("SELECT * FROM t"), Seq(Row(0, 0))) | ||
|
||
// Create new partition (part = 1) in the filesystem | ||
val information = sql("SHOW TABLE EXTENDED LIKE 't' PARTITION (part = 0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it may worth to pull this into an util method since it's useful in multiple places seems, e.g., the other PR on "ALTER TABLE .. RECOVER PARTITIONS"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, as soon as this can be used in one more place, we will move this to the common trait.
@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand( | |||
catalog.createPartitions(table.identifier, batch, ignoreIfExists = ifNotExists) | |||
} | |||
|
|||
sparkSession.catalog.refreshTable(table.identifier.quotedString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, does it matter whether we refresh cache before or after the stats are updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, it doesn't matter because table size is re-calculated by getting file statuses directly from the filesystem. So, the cached data is not used in updating table stats.
I think we should review other places.
Just in case, I will update the test and check that table size is updated after adding of the partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I will update the test and check that table size is updated after adding of the partition.
Let me add such checking later independently from this PR. I think I have found one more issue relating to looking for HiveTableRelation
in the cache of Cache Manager.
It seems cleaning the table stats made in #30995 is not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @MaxGekk 's decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, SGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the bug fix: #31112 . Updating of table stats triggers the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add such checking later independently from this PR.
I added the check for partition adding as well. Please, review this PR: #31131
...e/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks making sense. BTW @MaxGekk, how much works left to fix other places? I wonder if I should wait for the fixes before starting another RC.
…/v1/AlterTableAddPartitionSuite.scala
Merged to master. @MaxGekk can you make a backporting PR for branch-3.1? It has a conflict. |
…alls to Hive external catalog in partition adding ### What changes were proposed in this pull request? Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? There is a logical conflict between #31101 and #31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes apache#31101 from MaxGekk/add-partition-refresh-cache-2. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e0e06c1) Signed-off-by: Max Gekk <max.gekk@gmail.com>
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes apache#31101 from MaxGekk/add-partition-refresh-cache-2. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e0e06c1) Signed-off-by: Max Gekk <max.gekk@gmail.com>
@HyukjinKwon Here are the backports:
BTW, while backporting this PR to 3.1/3.0, I have realised that the test from this PR runs twice v1 In-Memory catalog. The PR #31117 runs the test for v1 Hive External catalog too. Just in case, I could make similar fix in v2 |
…Hive table ### What changes were proposed in this pull request? Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog. ### Why are the changes needed? The PR #31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
I moved my JIRA tickets to the umbrella: https://issues.apache.org/jira/browse/SPARK-33507 |
What changes were proposed in this pull request?
Invoke
refreshTable()
fromCatalogImpl
which refreshes the cache in v1ALTER TABLE .. ADD PARTITION
.Why are the changes needed?
This fixes the issues portrayed by the example:
Create new partition by copying the existing one:
The last query must return
0 1
since it has been added byALTER TABLE .. ADD PARTITION
.Does this PR introduce any user-facing change?
Yes. After the changes for the example above:
How was this patch tested?
By running the affected test suite: