Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION #31101

Closed
wants to merge 3 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 9, 2021

What changes were proposed in this pull request?

Invoke refreshTable() from CatalogImpl which refreshes the cache in v1 ALTER TABLE .. ADD PARTITION.

Why are the changes needed?

This fixes the issues portrayed by the example:

spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...

Create new partition by copying the existing one:

$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0

The last query must return 0 1 since it has been added by ALTER TABLE .. ADD PARTITION.

Does this PR introduce any user-facing change?

Yes. After the changes for the example above:

...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
0	1

How was this patch tested?

By running the affected test suite:

$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"

@github-actions github-actions bot added the SQL label Jan 9, 2021
@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38457/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38457/

@SparkQA
Copy link

SparkQA commented Jan 9, 2021

Test build #133868 has finished for PR 31101 at commit 10a0b96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 9, 2021

@dongjoon-hyun @sunchao @imback82 @cloud-fan May I ask you to review this PR.

@dongjoon-hyun
Copy link
Member

Sure, I'll do today, @MaxGekk

checkAnswer(sql("SELECT * FROM t"), Seq(Row(0, 0)))

// Create new partition (part = 1) in the filesystem
val information = sql("SHOW TABLE EXTENDED LIKE 't' PARTITION (part = 0)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it may worth to pull this into an util method since it's useful in multiple places seems, e.g., the other PR on "ALTER TABLE .. RECOVER PARTITIONS"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as soon as this can be used in one more place, we will move this to the common trait.

@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand(
catalog.createPartitions(table.identifier, batch, ignoreIfExists = ifNotExists)
}

sparkSession.catalog.refreshTable(table.identifier.quotedString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, does it matter whether we refresh cache before or after the stats are updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, it doesn't matter because table size is re-calculated by getting file statuses directly from the filesystem. So, the cached data is not used in updating table stats.

I think we should review other places.

Just in case, I will update the test and check that table size is updated after adding of the partition.

Copy link
Member Author

@MaxGekk MaxGekk Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I will update the test and check that table size is updated after adding of the partition.

Let me add such checking later independently from this PR. I think I have found one more issue relating to looking for HiveTableRelation in the cache of Cache Manager.

It seems cleaning the table stats made in #30995 is not enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @MaxGekk 's decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, SGTM :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the bug fix: #31112 . Updating of table stats triggers the bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add such checking later independently from this PR.

I added the check for partition adding as well. Please, review this PR: #31131

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks making sense. BTW @MaxGekk, how much works left to fix other places? I wonder if I should wait for the fixes before starting another RC.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk can you make a backporting PR for branch-3.1? It has a conflict.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 10, 2021

Unfortunately, tests on master start failing after this PR due to logical conflict between this PR and #31092. This PR fixes the issue #31111

HyukjinKwon pushed a commit that referenced this pull request Jan 10, 2021
…alls to Hive external catalog in partition adding

### What changes were proposed in this pull request?
Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`.

### Why are the changes needed?
There is a logical conflict between #31101 and #31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Jan 10, 2021
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`.

This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
```

The last query must return `0	1` since it has been added by `ALTER TABLE .. ADD PARTITION`.

Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
0	1
```

By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes apache#31101 from MaxGekk/add-partition-refresh-cache-2.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit e0e06c1)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Jan 10, 2021
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`.

This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
```

The last query must return `0	1` since it has been added by `ALTER TABLE .. ADD PARTITION`.

Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1';
spark-sql> select * from tbl;
0	0
0	1
```

By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes apache#31101 from MaxGekk/add-partition-refresh-cache-2.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit e0e06c1)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 10, 2021

can you make a backporting PR for branch-3.1? It has a conflict.

@HyukjinKwon Here are the backports:

BTW, while backporting this PR to 3.1/3.0, I have realised that the test from this PR runs twice v1 In-Memory catalog. The PR #31117 runs the test for v1 Hive External catalog too.

Just in case, I could make similar fix in v2 ALTER TABLE .. ADD PARTITION but I cannot write a test for that because existing V2 In-Memory catalog keeps partition data in memory, and partition locations are not taken into account. So, I cannot add new partition with data via SQL API. Is it ok if I make such fix without any test? cc @cloud-fan

@HyukjinKwon
Copy link
Member

@MaxGekk and @sunchao, can we have an umbrella ticket or epic ticket (if an umbrella ticket is not possible) to group these caching / uncaching issues? Let's block RC until we feel sure these issues are fixed then.

cloud-fan pushed a commit that referenced this pull request Jan 11, 2021
…Hive table

### What changes were proposed in this pull request?
Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog.

### Why are the changes needed?
The PR #31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2021

can we have an umbrella ticket or epic ticket (if an umbrella ticket is not possible) to group these caching / uncaching issues?

I moved my JIRA tickets to the umbrella: https://issues.apache.org/jira/browse/SPARK-33507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants