[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 #47152

asl3 · 2024-06-30T14:41:33Z

What changes were proposed in this pull request?

Add migration guide for CREATE TABLE AS SELECT... behavior change.

SPARK-41859 changes the behaviour for CREATE TABLE AS SELECT ... from OVERWRITE to APPEND when spark.sql.legacy.allowNonEmptyLocationInCTAS is set to true:

drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col;
drop table if exists test_table;
create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col;
select * from test_table;

This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in spark.sql.legacy.allowNonEmptyLocationInCTAS behaviour which introduces wrong results in the user application.

Why are the changes needed?

This documents a behavior change starting in Spark 3.4 for CREATE TABLE AS SELECT

Does this PR introduce any user-facing change?

No

How was this patch tested?

doc build

Was this patch authored or co-authored using generative AI tooling?

No.

asl3 · 2024-06-30T14:42:13Z

cc @cloud-fan

cloud-fan · 2024-07-01T08:10:09Z

docs/sql-migration-guide.md

@@ -97,6 +97,7 @@ license: |
  - Since Spark 3.4, `BinaryType` is not supported in CSV datasource. In Spark 3.3 or earlier, users can write binary columns in CSV datasource, but the output content in CSV files is `Object.toString()` which is meaningless; meanwhile, if users read CSV tables with binary columns, Spark will throw an `Unsupported type: binary` exception.
  - Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set `spark.sql.optimizer.runtime.bloomFilter.enabled` to `false`.
  - Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation `isAdjustedToUTC=false` will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`.
+  - Since Spark 3.4, the behaviour for `CREATE TABLE AS SELECT ...` is changed from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`. To restore the legacy behavior, set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `false`.


There is no way to restore the old behavior... I think we should ask users to move away from the legacy behavior that allows non-empty table location for CTAS.

cloud-fan · 2024-07-02T08:09:51Z

thanks, merging to master!

cloud-fan · 2024-07-02T08:19:48Z

@asl3 can you help to create a PR against branch 3.5?

…ior change behavior change since Spark 3.4 (branch-3.5) ### What changes were proposed in this pull request? This PR is a follow-up to #47152 against `branch-3.5`. Add migration guide for `CREATE TABLE AS SELECT...` behavior change. SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`: ``` drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col; drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col; select * from test_table; ``` This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application. ### Why are the changes needed? This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `doc build ` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47178 from asl3/allowNonEmptyLocationInCTAS-3.5. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ior change behavior change since Spark 3.4 (branch-3.5) ### What changes were proposed in this pull request? This PR is a follow-up to #47152 against `branch-3.5`. Add migration guide for `CREATE TABLE AS SELECT...` behavior change. SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`: ``` drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col; drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col; select * from test_table; ``` This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application. ### Why are the changes needed? This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `doc build ` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47178 from asl3/allowNonEmptyLocationInCTAS-3.5. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef4e456) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ior change behavior change since Spark 3.4 (branch-3.5) ### What changes were proposed in this pull request? This PR is a follow-up to apache#47152 against `branch-3.5`. Add migration guide for `CREATE TABLE AS SELECT...` behavior change. SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`: ``` drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col; drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col; select * from test_table; ``` This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application. ### Why are the changes needed? This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `doc build ` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47178 from asl3/allowNonEmptyLocationInCTAS-3.5. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ior change behavior change since Spark 3.4 (branch-3.5) ### What changes were proposed in this pull request? This PR is a follow-up to apache#47152 against `branch-3.5`. Add migration guide for `CREATE TABLE AS SELECT...` behavior change. SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`: ``` drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col; drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col; select * from test_table; ``` This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application. ### Why are the changes needed? This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `doc build ` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47178 from asl3/allowNonEmptyLocationInCTAS-3.5. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef4e456) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ior change behavior change since Spark 3.4 ### What changes were proposed in this pull request? Add migration guide for `CREATE TABLE AS SELECT...` behavior change. SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from OVERWRITE to APPEND when `spark.sql.legacy.allowNonEmptyLocationInCTAS` is set to `true`: ``` drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 1 as col union all select 2 as col; drop table if exists test_table; create table test_table location '/tmp/test_table' stored as parquet as select 3 as col union all select 4 as col; select * from test_table; ``` This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. This is a silent change in `spark.sql.legacy.allowNonEmptyLocationInCTAS` behaviour which introduces wrong results in the user application. ### Why are the changes needed? This documents a behavior change starting in Spark 3.4 for `CREATE TABLE AS SELECT` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `doc build ` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47152 from asl3/allowNonEmptyLocationInCTAS. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

add create table behavior change to SQL migration docs

91f1882

github-actions bot added the DOCS label Jun 30, 2024

cloud-fan reviewed Jul 1, 2024

View reviewed changes

ask users to avoid CTAS with nonempty table location

f023b1e

asl3 requested a review from cloud-fan July 1, 2024 15:09

cloud-fan approved these changes Jul 2, 2024

View reviewed changes

cloud-fan closed this in 8a5f4e0 Jul 2, 2024

asl3 mentioned this pull request Jul 2, 2024

[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 (branch-3.5) #47178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 #47152

[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 #47152

asl3 commented Jun 30, 2024 •

edited

Loading

asl3 commented Jun 30, 2024

cloud-fan Jul 1, 2024

cloud-fan commented Jul 2, 2024

cloud-fan commented Jul 2, 2024

[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 #47152

[SPARK-48759][SQL] Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 #47152

Conversation

asl3 commented Jun 30, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

asl3 commented Jun 30, 2024

cloud-fan Jul 1, 2024

Choose a reason for hiding this comment

cloud-fan commented Jul 2, 2024

cloud-fan commented Jul 2, 2024

asl3 commented Jun 30, 2024 •

edited

Loading