[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325

MaxGekk · 2018-12-15T16:14:24Z

What changes were proposed in this pull request?

Added new internal flag for JacksonParser - skipInputWithoutTokens to control parser's behavior when its input does not contain any valid JSON tokens. The flag is set to true for JSON datasource and enables the same behavior of the datasource as it has in Spark 2.4 and earlier. The flag is set to false for JSON functions like from_json.

The flag impacts only on handling bad JSON records without valid JSON tokens on the root level:

JSON datasource skips such records in the PERMISSIVE, FAILFAST and DROPMALFORMED modes. JSON datasource does not support ArrayType and MapType as the root type. For StructType, bad records are skipped and excluded from results.
from_json:
- in the FAILFAST mode, throws SparkException for all supported root types - StructType, ArrayType and MapType
- in the PERMISSIVE mode return a Row with nulls for all fields if specified root type is StructType (bad record string is placed to the corrupt column if it is specified), and null for ArrayType and MapType.
- the DROPMALFORMED mode is not supported.

Summary: The PR changes behavior of JSON datasource in the PERMISSIVE and FAILFAST modes by skipping bad records without valid JSON tokens.

How was this patch tested?

It was tested by JsonSuite.

MaxGekk · 2018-12-15T16:14:44Z

@cloud-fan Please, review the PR.

SparkQA · 2018-12-15T20:10:11Z

Test build #100193 has finished for PR 23325 at commit 32ec9ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-15T23:56:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+        dummySchema,
+        dummyOption,
+        allowArrayAsStructs = true,
+        skipInputWithoutTokens = true)


Shall we have a test coverage for both true and false? In case of false, we should not skip the rows, right?

I guess it's handled by from_json tests. If not, let's add a new UT here.

Both cases are covered already, for example:

skipInputWithoutTokens == false

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

Lines 547 to 553 in 3238e3d

test("SPARK-19543: from_json empty input column") {

val schema = StructType(StructField("a", IntegerType) :: Nil)

checkEvaluation(

JsonToStructs(schema, Map.empty, Literal.create(" ", StringType), gmtId),

InternalRow(null)

)

}

skipInputWithoutTokens == true https://github.com/apache/spark/pull/23325/files#diff-fde14032b0e6ef8086461edf79a27c5dR2520

Thank you for clarifying that, @MaxGekk and @cloud-fan .

dongjoon-hyun · 2018-12-15T23:57:16Z

docs/sql-migration-guide-upgrade.md

@@ -17,7 +17,7 @@ displayTitle: Spark SQL Upgrading Guide

  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

-  - In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
+  - In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings without valid root JSON tokens (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.


Skipping seems to be unclear here. Could you elaborate the difference?

Do you want to compare JSON datasource and json functions in this note?

cloud-fan · 2018-12-16T01:39:51Z

+1 for fixing this behavior change, thanks!

HyukjinKwon · 2018-12-16T03:34:41Z

How about we just revert 38628dd?

cloud-fan · 2018-12-16T03:40:03Z

revert it and reopen a PR with the original commit and this fix? I'm fine with it, but I'm not sure if it's a clean revert. IIRC there are multiple JSON related PRs merged recently.

HyukjinKwon · 2018-12-16T03:42:12Z

Hm, I wasn't able to follow why we keep the behaviour in JSON datasource but not in from_json. I was wondering if we better revert it and make a fix again (considering 38628dd already caused a couple of problems that had to be fixed).

cloud-fan · 2018-12-16T04:11:17Z

IIUC #22938 tried to change the behavior of from_json intentionally, but changed the behavior of json data source unexpectedly.

This PR is to fix this problem.

HyukjinKwon · 2018-12-16T05:24:57Z

Yea, correct. One of the behaviour changes was,

After the change, an empty string (no JSON token) produces a row of nulls (Row(null, null ...)) in both JSON datasource and JSON functions.

Before the change:

JSON datasource previously ignored this (no row) instead of Row(null, null ...).
JSON functions previously produced null instead of Row(null, null ...).

However, we're reverting one case of both now.

cloud-fan · 2018-12-17T00:18:13Z

These 2 cases can't be consistent. from_json can't skip input. If it returns null, it's still not consistent with json data source which returns Nil.

It's arguable if json data source should follow from_json and treat empty token as invalid input. But I think it's safer to not introduce behavior change if we are not sure one way is better than another.

while for from_json, I do think treating empty token as invalid is better than returning null.

HyukjinKwon · 2018-12-17T02:29:00Z

Hm, the point of view was different. Yea, now I see they can't be consistent. To me, I didn't stay against in #22938 because the changes looked at least coherent because they look consistent at that time.

while for from_json, I do think treating empty token as invalid is better than returning null.

I think this one is also kind of arguable as well ..

Thing is, we currently return null for array types or map types. In that way, we should also produce empty array or empty map as @MaxGekk said at #22938 (comment).

If this is the change alone at #22938, I actually would have been hesitant about going ahead ..

cloud-fan · 2018-12-17T03:06:11Z

ah this is a good point. I think PERMISSIVE mode doesn't make sense for array/map as we can't have a special column to put the original token.

Now we have several things to consider together to decide the behavior:

it's from_json or json data source
the result is row or array/map (when it's from_json)
the parse mode
the token is valid or not

@MaxGekk can you describe the behavior you proposed?

MaxGekk · 2018-12-18T22:42:14Z

can you describe the behavior you proposed?

@cloud-fan I updated PR's description.

cloud-fan · 2018-12-19T04:12:13Z

in the PERMISSIVE mode return a Row with nulls for all fields if specified root type is StructType (bad record string is placed to the corrupt column if it is specified), and null for ArrayType and MapType.

I think this is the most arguable part. The current behavior looks not reasonable, I can think of 2 options:

always return null if the token is empty, no matter it's row or array or map.
2 never return null if the token is empty. For struct type, return a row with all null fields. For array/map, return empty array/map.

@HyukjinKwon @MaxGekk any preference?

MaxGekk · 2018-12-19T10:25:51Z

always return null if the token is empty, no matter it's row or array or map.

In general, returning nulls for malformed rows is fine. The problem is "embedding" the corrupt column into user's schema if the root type is StructType, and ignoring the corrupt column for other types. I think extending user's schema by additional column wasn't right design decision. I would wrap user's root type by a struct with 2 columns - parsed column of user specified type (struct, array and map) and a column of string type with unparsed text. If the input wasn't parsed, just put null to the first column independently of its type.

never return null if the token is empty. For struct type, return a row with all null fields. For array/map, return empty array/map.

This looks like more consistent approach across supported types but even it raises some questions:

how to distinguish a row with all null, empty array/map in the input from unparsed input?
performance penalty. Probably it could be avoided.
if schema is not flat, let's say a array<...>, b map<...>, why do we return a row with nulls instead of a row with empty array and empty map?
corrupt column feature is still not supported for arrays and maps

If need to make a choice of the 2 approaches above, I would prefer the first one probably. And I would re-implement the columnNameOfCorruptRecord feature to support other types.

MaxGekk · 2018-12-22T10:10:36Z

@cloud-fan Can we move forward with the particular changes in the PR?

cloud-fan · 2018-12-24T07:29:37Z

It seems also reasonable to accept @HyukjinKwon 's proposal: just revert 38628dd

MaxGekk · 2018-12-24T11:28:34Z

It seems also reasonable to accept @HyukjinKwon 's proposal: just revert 38628dd

ok. Let me try to revert the commit locally. I guess it won't be reverted smoothly. If not, I could create a PR for that.

MaxGekk · 2018-12-24T12:32:29Z

Let me try to revert the commit locally.

@cloud-fan Revert causes conflicts in the migration guide. Let me know if you will need a PR.

cloud-fan · 2018-12-25T03:31:30Z

@HyukjinKwon what do you think?

HyukjinKwon · 2018-12-25T07:51:13Z

Yea, reverting sounds good to me .. if you guys don't strongly feel about a specific way.

MaxGekk · 2018-12-28T16:23:50Z

Let's imagine a situation when an user uses from_json to parse a column with JSON records and for some of the records JSON parser wasn't able to detect any JSON tokens. It could be empty string, string with a few spaces or something else. Would it be useful to find out what particular strings the parser cannot parser? If we return such strings in the corrupt column, probably it could help the user in trouble shooting. Just returning nulls won't give any opportunities to user to debug or put such string to a separate dataset. I think it makes sense to keep current behavior of from_json and revert back behavior of JSON datasource only as the PR proposes.

cloud-fan · 2018-12-30T05:08:52Z

if it's only for troubleshooting, I guess users can do if(IsNull(FromJson(col)), col, FromJson(col)).

My major concern is, PERMISSIVE mode doesn't work well in from_json, because we can return map/array. Maybe another choice is, if from_jsom returns array/map, we should forbid PERMISSIVE mode.

MaxGekk · 2018-12-30T12:45:04Z

I still think the PERMISSIVE mode could be useful even for map/arrays if an user wants to bypass error and parse all column values.

cloud-fan · 2019-01-02T03:16:21Z

then how about from_json always return null for corrupted record if mode is PERMISSIVE?

MaxGekk · 2019-01-02T09:56:07Z

then how about from_json always return null for corrupted record if mode is PERMISSIVE?

Does always mean we should return nulls for structs, right? So, it actually means, need to revert commits related to partial results #23253 & #23120 and for sure #22938.

Frankly speaking, I would avoid this. Just to be clear, the main motivation for rejection of this PR and reverting 3 above is the columnNameOfCorruptRecord option cannot be supported for maps/arrays in the PERMISSIVE mode for from_json, am I right?

cloud-fan · 2019-01-02T10:51:47Z

yes. I think it's more important to make the behavior of returning struct/array/map in from_json consistent, than making from_json and json data source consistent.

SparkQA · 2019-01-07T15:27:31Z

Test build #100881 has finished for PR 23325 at commit 4ea4f31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-11T20:54:45Z

I am going to close this PR since it stuck.

HyukjinKwon · 2019-01-14T13:16:07Z

Can we just revert 38628dd and try to change it again? Yea, it's somehow stuck ..

## What changes were proposed in this pull request? This PR reverts apache#22938 per discussion in apache#23325 Closes apache#23325 Closes apache#23543 from MaxGekk/return-nulls-from-json-parser. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 2 commits December 15, 2018 16:54

Fix the migration guide

240a479

Added the skipInputWithoutTokens flag to JacksonParser

32ec9ba

dongjoon-hyun reviewed Dec 15, 2018

View reviewed changes

Merge branch 'master' into json-ignore-no-tokens

4ea4f31

MaxGekk mentioned this pull request Jan 14, 2019

[SPARK-25935][SQL] Allow null rows for bad records from JSON parsers #23543

Closed

cloud-fan closed this in 33b5039 Jan 15, 2019

MaxGekk deleted the json-ignore-no-tokens branch August 17, 2019 13:36

	test("SPARK-19543: from_json empty input column") {
	val schema = StructType(StructField("a", IntegerType) :: Nil)
	checkEvaluation(
	JsonToStructs(schema, Map.empty, Literal.create(" ", StringType), gmtId),
	InternalRow(null)
	)
	}

[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325

[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325

Conversation

MaxGekk commented Dec 15, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

MaxGekk commented Dec 15, 2018

SparkQA commented Dec 15, 2018

dongjoon-hyun Dec 15, 2018

Choose a reason for hiding this comment

cloud-fan Dec 16, 2018

Choose a reason for hiding this comment

MaxGekk Dec 16, 2018

Choose a reason for hiding this comment

dongjoon-hyun Dec 16, 2018

Choose a reason for hiding this comment

dongjoon-hyun Dec 15, 2018

Choose a reason for hiding this comment

MaxGekk Dec 16, 2018

Choose a reason for hiding this comment

cloud-fan commented Dec 16, 2018

HyukjinKwon commented Dec 16, 2018

cloud-fan commented Dec 16, 2018

HyukjinKwon commented Dec 16, 2018 • edited Loading

cloud-fan commented Dec 16, 2018

HyukjinKwon commented Dec 16, 2018 • edited Loading

cloud-fan commented Dec 17, 2018

HyukjinKwon commented Dec 17, 2018

cloud-fan commented Dec 17, 2018

MaxGekk commented Dec 18, 2018

cloud-fan commented Dec 19, 2018

MaxGekk commented Dec 19, 2018

MaxGekk commented Dec 22, 2018

cloud-fan commented Dec 24, 2018

MaxGekk commented Dec 24, 2018

MaxGekk commented Dec 24, 2018

cloud-fan commented Dec 25, 2018

HyukjinKwon commented Dec 25, 2018

MaxGekk commented Dec 28, 2018

cloud-fan commented Dec 30, 2018

MaxGekk commented Dec 30, 2018

cloud-fan commented Jan 2, 2019

MaxGekk commented Jan 2, 2019

cloud-fan commented Jan 2, 2019

SparkQA commented Jan 7, 2019

MaxGekk commented Jan 11, 2019

HyukjinKwon commented Jan 14, 2019

MaxGekk commented Dec 15, 2018 •

edited

Loading

HyukjinKwon commented Dec 16, 2018 •

edited

Loading

HyukjinKwon commented Dec 16, 2018 •

edited

Loading