Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding #29063

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 10, 2020

What changes were proposed in this pull request?

This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at #15813 (comment).

We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different alone.

There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different).

Why are the changes needed?

For consistency, potentially better performance, and fixing a potentially very corner case bug.

Does this PR introduce any user-facing change?

Virtually no.

How was this patch tested?

Existing tests should cover.

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125594 has started for PR 29063 at commit 3250f33.

@HyukjinKwon HyukjinKwon changed the title [SPARK-32270][SQL] Use text file sources in CSV's schema inference in different encoding as well [SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding Jul 10, 2020
@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125658 has finished for PR 29063 at commit 3250f33.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS does this suggest a change to the way spark-xml reads things? separate question of course.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 11, 2020

Ah, it does. We should better change spark-xml to use SQL source instead of RDD APIs during schema inference but I think that would be difficult because the record separator in this case is newline in CSV and JSON but spark-xml is dependent on the custom Hadoop input format ...

In most cases it wouldn't be a big deal so I guess it's fine to don't change at this moment unless any major issue is found.
I think we could solve this issue together once we migrate from DSv1 to DSv2 in Spark XML .. but I guess it's a bit far future ..

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125683 has finished for PR 29063 at commit 3250f33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @HyukjinKwon . Yes, SQL code path is better.
Merged to master.

@HyukjinKwon
Copy link
Member Author

Thanks @srowen and @dongjoon-hyun

@HyukjinKwon HyukjinKwon deleted the SPARK-32270 branch July 27, 2020 07:43
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…a different encoding

This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at apache#15813 (comment).

We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone.

There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different).

For consistency, potentially better performance, and fixing a potentially very corner case bug.

Virtually no.

Existing tests should cover.

Closes apache#29063 from HyukjinKwon/SPARK-32270.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
HyukjinKwon added a commit that referenced this pull request Jul 16, 2024
…aframe read / write API

### What changes were proposed in this pull request?

This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition:

https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57

### Why are the changes needed?

In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918.

Also, we remove `repartition(1)`. To avoid unnecessary shuffle.

With `repartition(1)`:

```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]
   +- LocalTableScan [_1#0]
```

Without `repartition(1)`:

```
== Physical Plan ==
LocalTableScan [_1#2]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR should verify the change

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47341 from HyukjinKwon/SPARK-48883-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…aframe read / write API

### What changes were proposed in this pull request?

This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition:

https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57

### Why are the changes needed?

In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918.

Also, we remove `repartition(1)`. To avoid unnecessary shuffle.

With `repartition(1)`:

```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]
   +- LocalTableScan [_1#0]
```

Without `repartition(1)`:

```
== Physical Plan ==
LocalTableScan [_1#2]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR should verify the change

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47341 from HyukjinKwon/SPARK-48883-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…aframe read / write API

### What changes were proposed in this pull request?

This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition:

https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57

### Why are the changes needed?

In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918.

Also, we remove `repartition(1)`. To avoid unnecessary shuffle.

With `repartition(1)`:

```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]
   +- LocalTableScan [_1#0]
```

Without `repartition(1)`:

```
== Physical Plan ==
LocalTableScan [_1#2]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR should verify the change

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47341 from HyukjinKwon/SPARK-48883-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants