[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring #21292

MaxGekk · 2018-05-10T12:59:31Z

What changes were proposed in this pull request?

While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302

but the options are not propagated to Text datasource on schema inferring, for instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188

The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified.

How was this patch tested?

The changes were tested manually by using https://github.com/twitter/hadoop-lzo:

hadoop-lzo> mvn clean package
hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar

Create 2 test files in JSON and CSV format and compress them:

$ cat test.csv
col1|col2
a|1
$ lzop test.csv
$ cat test.json
{"col1":"a","col2":1}
$ lzop test.json

Run spark-shell with hadoop-lzo:

bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar

reading compressed CSV and JSON without schema:

spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
+----+----+
|col1|col2|
+----+----+
|   a|   1|
+----+----+

spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)

…a inferring

HyukjinKwon · 2018-05-10T13:02:22Z

add to whitelist

gatorsmile · 2018-05-10T16:13:29Z

LGTM

SparkQA · 2018-05-10T16:24:01Z

Test build #90458 has finished for PR 21292 at commit f6ab928.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-10T16:27:13Z

Merged to branch-2.3.

…Text datasource on schema inferring ## What changes were proposed in this pull request? While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302 but the options are not propagated to Text datasource on schema inferring, for instance: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188 The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified. ## How was this patch tested? The changes were tested manually by using https://github.com/twitter/hadoop-lzo: ``` hadoop-lzo> mvn clean package hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar ``` Create 2 test files in JSON and CSV format and compress them: ```shell $ cat test.csv col1|col2 a|1 $ lzop test.csv $ cat test.json {"col1":"a","col2":1} $ lzop test.json ``` Run `spark-shell` with hadoop-lzo: ``` bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar ``` reading compressed CSV and JSON without schema: ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show() +----+----+ |col1|col2| +----+----+ | a| 1| +----+----+ ``` ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema root |-- col1: string (nullable = true) |-- col2: long (nullable = true) ``` Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21292 from MaxGekk/text-options-backport-v2.3.

HyukjinKwon · 2018-05-10T16:29:26Z

@MaxGekk, BTW, it's not automatically closed when backporting PR is merged into other branches. Mind manually closing this please?

MaxGekk and others added 5 commits May 10, 2018 14:24

Propagating DataFrameReader's options to the text datasource on schem…

9092faa

…a inferring

Make textOptions serializable

7b4a6b4

Adding @transient to textOptions because they shouldn't be serialized

fe6c3c2

Removing the separate val for textOptions

831441b

Removing unused imports

f6ab928

MaxGekk mentioned this pull request May 10, 2018

[SPARK-24068] Propagating DataFrameReader's options to Text datasource on schema inferring #21182

Closed

MaxGekk closed this May 10, 2018

MaxGekk deleted the text-options-backport-v2.3 branch August 17, 2019 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring #21292

[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring #21292

MaxGekk commented May 10, 2018

HyukjinKwon commented May 10, 2018

gatorsmile commented May 10, 2018

SparkQA commented May 10, 2018

HyukjinKwon commented May 10, 2018

HyukjinKwon commented May 10, 2018

[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring #21292

[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring #21292

Conversation

MaxGekk commented May 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented May 10, 2018

gatorsmile commented May 10, 2018

SparkQA commented May 10, 2018

HyukjinKwon commented May 10, 2018

HyukjinKwon commented May 10, 2018