[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` #31101

MaxGekk · 2021-01-09T08:01:42Z

What changes were proposed in this pull request?

Invoke refreshTable() from CatalogImpl which refreshes the cache in v1 ALTER TABLE .. ADD PARTITION.

Why are the changes needed?

This fixes the issues portrayed by the example:

spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...

Create new partition by copying the existing one:

$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1

spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0

The last query must return 0 1 since it has been added by ALTER TABLE .. ADD PARTITION.

Does this PR introduce any user-facing change?

Yes. After the changes for the example above:

...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
0	1

How was this patch tested?

By running the affected test suite:

$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"

SparkQA · 2021-01-09T09:00:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38457/

SparkQA · 2021-01-09T09:04:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38457/

SparkQA · 2021-01-09T13:12:40Z

Test build #133868 has finished for PR 31101 at commit 10a0b96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-01-09T17:51:30Z

@dongjoon-hyun @sunchao @imback82 @cloud-fan May I ask you to review this PR.

dongjoon-hyun · 2021-01-09T17:53:04Z

Sure, I'll do today, @MaxGekk

sunchao · 2021-01-09T18:05:26Z

...e/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala

+      checkAnswer(sql("SELECT * FROM t"), Seq(Row(0, 0)))
+
+      // Create new partition (part = 1) in the filesystem
+      val information = sql("SHOW TABLE EXTENDED LIKE 't' PARTITION (part = 0)")


nit: it may worth to pull this into an util method since it's useful in multiple places seems, e.g., the other PR on "ALTER TABLE .. RECOVER PARTITIONS"

Sure, as soon as this can be used in one more place, we will move this to the common trait.

sunchao · 2021-01-09T18:11:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand(
      catalog.createPartitions(table.identifier, batch, ignoreIfExists = ifNotExists)
    }

+    sparkSession.catalog.refreshTable(table.identifier.quotedString)


just curious, does it matter whether we refresh cache before or after the stats are updated?

In this particular case, it doesn't matter because table size is re-calculated by getting file statuses directly from the filesystem. So, the cached data is not used in updating table stats.

I think we should review other places.

Just in case, I will update the test and check that table size is updated after adding of the partition.

Just in case, I will update the test and check that table size is updated after adding of the partition.

Let me add such checking later independently from this PR. I think I have found one more issue relating to looking for HiveTableRelation in the cache of Cache Manager.

It seems cleaning the table stats made in #30995 is not enough.

+1 for @MaxGekk 's decision.

Yup, SGTM :)

Here is the bug fix: #31112 . Updating of table stats triggers the bug.

Let me add such checking later independently from this PR.

I added the check for partition adding as well. Please, review this PR: #31131

...e/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala

HyukjinKwon

Looks making sense. BTW @MaxGekk, how much works left to fix other places? I wonder if I should wait for the fixes before starting another RC.

…/v1/AlterTableAddPartitionSuite.scala

HyukjinKwon · 2021-01-10T05:07:07Z

Merged to master.

@MaxGekk can you make a backporting PR for branch-3.1? It has a conflict.

MaxGekk · 2021-01-10T07:03:46Z

Unfortunately, tests on master start failing after this PR due to logical conflict between this PR and #31092. This PR fixes the issue #31111

…alls to Hive external catalog in partition adding ### What changes were proposed in this pull request? Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? There is a logical conflict between #31101 and #31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes apache#31101 from MaxGekk/add-partition-refresh-cache-2. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e0e06c1) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2021-01-10T17:25:37Z

can you make a backporting PR for branch-3.1? It has a conflict.

@HyukjinKwon Here are the backports:

branch-3.0: [SPARK-34055][SQL][3.0] Refresh cache in ALTER TABLE .. ADD PARTITION #31116
branch-3.1: [SPARK-34055][SQL][3.1] Refresh cache in ALTER TABLE .. ADD PARTITION #31115

BTW, while backporting this PR to 3.1/3.0, I have realised that the test from this PR runs twice v1 In-Memory catalog. The PR #31117 runs the test for v1 Hive External catalog too.

Just in case, I could make similar fix in v2 ALTER TABLE .. ADD PARTITION but I cannot write a test for that because existing V2 In-Memory catalog keeps partition data in memory, and partition locations are not taken into account. So, I cannot add new partition with data via SQL API. Is it ok if I make such fix without any test? cc @cloud-fan

HyukjinKwon · 2021-01-11T00:41:12Z

@MaxGekk and @sunchao, can we have an umbrella ticket or epic ticket (if an umbrella ticket is not possible) to group these caching / uncaching issues? Let's block RC until we feel sure these issues are fixed then.

…Hive table ### What changes were proposed in this pull request? Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog. ### Why are the changes needed? The PR #31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk · 2021-01-11T13:33:49Z

can we have an umbrella ticket or epic ticket (if an umbrella ticket is not possible) to group these caching / uncaching issues?

I moved my JIRA tickets to the umbrella: https://issues.apache.org/jira/browse/SPARK-33507

MaxGekk added 2 commits January 9, 2021 10:54

Add a test

e43996f

Call refreshTable in AlterTableAddPartitionCommand

10a0b96

github-actions bot added the SQL label Jan 9, 2021

sunchao reviewed Jan 9, 2021

View reviewed changes

dongjoon-hyun reviewed Jan 9, 2021

View reviewed changes

...e/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Jan 10, 2021

View reviewed changes

Update sql/core/src/test/scala/org/apache/spark/sql/execution/command…

8aa432c

…/v1/AlterTableAddPartitionSuite.scala

HyukjinKwon closed this in e0e06c1 Jan 10, 2021

MaxGekk mentioned this pull request Jan 10, 2021

[SPARK-34055][SQL][TESTS][FOLLOWUP] Increase the expected number of calls to Hive external catalog in partition adding #31111

Closed

MaxGekk mentioned this pull request Jan 10, 2021

[SPARK-34055][SQL][TESTS][FOLLOWUP] Check partition adding to cached Hive table #31117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` #31101

[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` #31101

MaxGekk commented Jan 9, 2021

SparkQA commented Jan 9, 2021

SparkQA commented Jan 9, 2021

SparkQA commented Jan 9, 2021

MaxGekk commented Jan 9, 2021

dongjoon-hyun commented Jan 9, 2021

sunchao Jan 9, 2021

MaxGekk Jan 9, 2021

sunchao Jan 9, 2021

MaxGekk Jan 9, 2021

MaxGekk Jan 9, 2021 •

edited

Loading

dongjoon-hyun Jan 9, 2021

sunchao Jan 9, 2021

MaxGekk Jan 10, 2021

MaxGekk Jan 11, 2021

HyukjinKwon left a comment

HyukjinKwon commented Jan 10, 2021

MaxGekk commented Jan 10, 2021

MaxGekk commented Jan 10, 2021 •

edited

Loading

HyukjinKwon commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

[SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION #31101

[SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION #31101

Conversation

MaxGekk commented Jan 9, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jan 9, 2021

SparkQA commented Jan 9, 2021

SparkQA commented Jan 9, 2021

MaxGekk commented Jan 9, 2021

dongjoon-hyun commented Jan 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk Jan 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 10, 2021

MaxGekk commented Jan 10, 2021

MaxGekk commented Jan 10, 2021 • edited Loading

HyukjinKwon commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` #31101

[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` #31101

MaxGekk Jan 9, 2021 •

edited

Loading

MaxGekk commented Jan 10, 2021 •

edited

Loading