Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389

Closed
wants to merge 12 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented May 22, 2018

What changes were proposed in this pull request?

This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat.

How was this patch tested?

Added verification tests in FileBasedDataSourceSuite and HiveOrcSourceSuite.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90934 has finished for PR 21389 at commit 0d88bcb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90937 has finished for PR 21389 at commit 00208be.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 22, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90944 has finished for PR 21389 at commit 00208be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Verify if the schema is supported in JSON datasource.
*/
def verifySchema(schema: StructType): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function verifySchema is very similar with the one in Orc/Parquet except the exception message. Should we put it into a util object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm .. but wouldn't the supported types be very specific to data source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since supported types are specific to data sources, I think we need to verify a schema in each file format implementations. But, yes.... these built-in format (orc and parquet) has the same supported types, so it might be better to move the code veryfySchema into somewhere (e.g., DataSourceUtils or something) for avoiding code duplication....

@maropu
Copy link
Member Author

maropu commented May 22, 2018

cc: @dongjoon-hyun

@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()),
Row(badJson))
}

test("SPARK-24204 error handling for unsupported data types") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @maropu .
Since this is all about file-based data sources, can we have all these test cases in FileBasedDataSourceSuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably, the suggestion is related to the one above) Essentially, the supported types are specific to datasource implementations, so I'm not 100% sure that it's the best to put these tests in FileBasedDataSourceSuite .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suite doesn't assume that all file-based data source has the same capability. In this PR, the test codes are almost the same and the only difference are the mapping tables. For example,

  • json -> Interval
  • orc -> Interval, Null
  • parquet -> Interval, Null

Copy link
Member Author

@maropu maropu May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll try to brush up the tests.

@gengliangwang
Copy link
Member

retest this please.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91010 has finished for PR 21389 at commit c7fe24f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -89,6 +89,8 @@ class OrcFileFormat
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
DataSourceUtils.verifySchema("ORC", dataSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for refactoring the PR, @maropu ! What about using shortName instead of string literal "ORC" here? Then, we can have the same line like the following.

DataSourceUtils.verifySchema(shortName, dataSchema)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that's smart. I will update.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91011 has finished for PR 21389 at commit e4ebac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91016 has finished for PR 21389 at commit e4ebac2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91035 has finished for PR 21389 at commit e4ebac2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91048 has finished for PR 21389 at commit e4ebac2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91059 has finished for PR 21389 at commit e4ebac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


case udt: UserDefinedType[_] => verifyType(udt.sqlType)

// For backward-compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any test case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will.
Also, we need to merge this function with CSVUtils.verifySchema in this pr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as long as it does not break anything.

case NullType if format == "JSON" =>

case _ =>
throw new UnsupportedOperationException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, for such a PR, we need to check all the data types that we block and ensure no behavior change is introduced by this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented May 28, 2018

Test build #91208 has finished for PR 21389 at commit 04f4028.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 14, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91806 has finished for PR 21389 at commit 04f4028.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

object DataSourceUtils {

/**
* Verify if the schema is supported in datasource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the description? and document which built-in file formats are covered by this function. Also document which data types are not supported for each data source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@gatorsmile
Copy link
Member

@maropu Just want to double check whether all the data types are not supported before this PR? Have you ran these test cases without the code changes? After this PR, the error messages are more readable and captured earlier?

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91895 has finished for PR 21389 at commit 92d3553.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 15, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92359 has finished for PR 21389 at commit a3c400d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92363 has finished for PR 21389 at commit c306953.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92368 has finished for PR 21389 at commit 1cfc7b0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 27, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92382 has finished for PR 21389 at commit 1cfc7b0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

LGTM

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92390 has finished for PR 21389 at commit 1cfc7b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 893ea22 Jun 27, 2018
dongjoon-hyun pushed a commit that referenced this pull request Jan 27, 2019
…ld be consistent

## What changes were proposed in this pull request?

1. Remove parameter `isReadPath`. The supported types of read/write should be the same.

2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`.

## How was this patch tested?

Unit tset

Closes #23639 from gengliangwang/supportDataType.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ld be consistent

## What changes were proposed in this pull request?

1. Remove parameter `isReadPath`. The supported types of read/write should be the same.

2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`.

## How was this patch tested?

Unit tset

Closes apache#23639 from gengliangwang/supportDataType.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants