Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325

Closed
wants to merge 3 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 15, 2018

What changes were proposed in this pull request?

Added new internal flag for JacksonParser - skipInputWithoutTokens to control parser's behavior when its input does not contain any valid JSON tokens. The flag is set to true for JSON datasource and enables the same behavior of the datasource as it has in Spark 2.4 and earlier. The flag is set to false for JSON functions like from_json.

The flag impacts only on handling bad JSON records without valid JSON tokens on the root level:

  • JSON datasource skips such records in the PERMISSIVE, FAILFAST and DROPMALFORMED modes. JSON datasource does not support ArrayType and MapType as the root type. For StructType, bad records are skipped and excluded from results.
  • from_json:
    • in the FAILFAST mode, throws SparkException for all supported root types - StructType, ArrayType and MapType
    • in the PERMISSIVE mode return a Row with nulls for all fields if specified root type is StructType (bad record string is placed to the corrupt column if it is specified), and null for ArrayType and MapType.
    • the DROPMALFORMED mode is not supported.

Summary: The PR changes behavior of JSON datasource in the PERMISSIVE and FAILFAST modes by skipping bad records without valid JSON tokens.

How was this patch tested?

It was tested by JsonSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 15, 2018

@cloud-fan Please, review the PR.

@SparkQA
Copy link

SparkQA commented Dec 15, 2018

Test build #100193 has finished for PR 23325 at commit 32ec9ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dummySchema,
dummyOption,
allowArrayAsStructs = true,
skipInputWithoutTokens = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a test coverage for both true and false? In case of false, we should not skip the rows, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's handled by from_json tests. If not, let's add a new UT here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both cases are covered already, for example:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying that, @MaxGekk and @cloud-fan .

@@ -17,7 +17,7 @@ displayTitle: Spark SQL Upgrading Guide

- Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings without valid root JSON tokens (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping seems to be unclear here. Could you elaborate the difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to compare JSON datasource and json functions in this note?

@cloud-fan
Copy link
Contributor

+1 for fixing this behavior change, thanks!

@HyukjinKwon
Copy link
Member

How about we just revert 38628dd?

@cloud-fan
Copy link
Contributor

revert it and reopen a PR with the original commit and this fix? I'm fine with it, but I'm not sure if it's a clean revert. IIRC there are multiple JSON related PRs merged recently.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 16, 2018

Hm, I wasn't able to follow why we keep the behaviour in JSON datasource but not in from_json. I was wondering if we better revert it and make a fix again (considering 38628dd already caused a couple of problems that had to be fixed).

@cloud-fan
Copy link
Contributor

IIUC #22938 tried to change the behavior of from_json intentionally, but changed the behavior of json data source unexpectedly.

This PR is to fix this problem.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 16, 2018

Yea, correct. One of the behaviour changes was,

After the change, an empty string (no JSON token) produces a row of nulls (Row(null, null ...)) in both JSON datasource and JSON functions.

Before the change:

  1. JSON datasource previously ignored this (no row) instead of Row(null, null ...).
  2. JSON functions previously produced null instead of Row(null, null ...).

However, we're reverting one case of both now.

@cloud-fan
Copy link
Contributor

These 2 cases can't be consistent. from_json can't skip input. If it returns null, it's still not consistent with json data source which returns Nil.

It's arguable if json data source should follow from_json and treat empty token as invalid input. But I think it's safer to not introduce behavior change if we are not sure one way is better than another.

while for from_json, I do think treating empty token as invalid is better than returning null.

@HyukjinKwon
Copy link
Member

Hm, the point of view was different. Yea, now I see they can't be consistent. To me, I didn't stay against in #22938 because the changes looked at least coherent because they look consistent at that time.

while for from_json, I do think treating empty token as invalid is better than returning null.

I think this one is also kind of arguable as well ..

Thing is, we currently return null for array types or map types. In that way, we should also produce empty array or empty map as @MaxGekk said at #22938 (comment).

If this is the change alone at #22938, I actually would have been hesitant about going ahead ..

@cloud-fan
Copy link
Contributor

ah this is a good point. I think PERMISSIVE mode doesn't make sense for array/map as we can't have a special column to put the original token.

Now we have several things to consider together to decide the behavior:

  1. it's from_json or json data source
  2. the result is row or array/map (when it's from_json)
  3. the parse mode
  4. the token is valid or not

@MaxGekk can you describe the behavior you proposed?

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 18, 2018

can you describe the behavior you proposed?

@cloud-fan I updated PR's description.

@cloud-fan
Copy link
Contributor

in the PERMISSIVE mode return a Row with nulls for all fields if specified root type is StructType (bad record string is placed to the corrupt column if it is specified), and null for ArrayType and MapType.

I think this is the most arguable part. The current behavior looks not reasonable, I can think of 2 options:

  1. always return null if the token is empty, no matter it's row or array or map.
    2 never return null if the token is empty. For struct type, return a row with all null fields. For array/map, return empty array/map.

@HyukjinKwon @MaxGekk any preference?

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 19, 2018

always return null if the token is empty, no matter it's row or array or map.

In general, returning nulls for malformed rows is fine. The problem is "embedding" the corrupt column into user's schema if the root type is StructType, and ignoring the corrupt column for other types. I think extending user's schema by additional column wasn't right design decision. I would wrap user's root type by a struct with 2 columns - parsed column of user specified type (struct, array and map) and a column of string type with unparsed text. If the input wasn't parsed, just put null to the first column independently of its type.

never return null if the token is empty. For struct type, return a row with all null fields. For array/map, return empty array/map.

This looks like more consistent approach across supported types but even it raises some questions:

  • how to distinguish a row with all null, empty array/map in the input from unparsed input?
  • performance penalty. Probably it could be avoided.
  • if schema is not flat, let's say a array<...>, b map<...>, why do we return a row with nulls instead of a row with empty array and empty map?
  • corrupt column feature is still not supported for arrays and maps

If need to make a choice of the 2 approaches above, I would prefer the first one probably. And I would re-implement the columnNameOfCorruptRecord feature to support other types.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 22, 2018

@cloud-fan Can we move forward with the particular changes in the PR?

@cloud-fan
Copy link
Contributor

It seems also reasonable to accept @HyukjinKwon 's proposal: just revert 38628dd

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

It seems also reasonable to accept @HyukjinKwon 's proposal: just revert 38628dd

ok. Let me try to revert the commit locally. I guess it won't be reverted smoothly. If not, I could create a PR for that.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

Let me try to revert the commit locally.

@cloud-fan Revert causes conflicts in the migration guide. Let me know if you will need a PR.

@cloud-fan
Copy link
Contributor

@HyukjinKwon what do you think?

@HyukjinKwon
Copy link
Member

Yea, reverting sounds good to me .. if you guys don't strongly feel about a specific way.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 28, 2018

Let's imagine a situation when an user uses from_json to parse a column with JSON records and for some of the records JSON parser wasn't able to detect any JSON tokens. It could be empty string, string with a few spaces or something else. Would it be useful to find out what particular strings the parser cannot parser? If we return such strings in the corrupt column, probably it could help the user in trouble shooting. Just returning nulls won't give any opportunities to user to debug or put such string to a separate dataset. I think it makes sense to keep current behavior of from_json and revert back behavior of JSON datasource only as the PR proposes.

@cloud-fan
Copy link
Contributor

if it's only for troubleshooting, I guess users can do if(IsNull(FromJson(col)), col, FromJson(col)).

My major concern is, PERMISSIVE mode doesn't work well in from_json, because we can return map/array. Maybe another choice is, if from_jsom returns array/map, we should forbid PERMISSIVE mode.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 30, 2018

I still think the PERMISSIVE mode could be useful even for map/arrays if an user wants to bypass error and parse all column values.

@cloud-fan
Copy link
Contributor

then how about from_json always return null for corrupted record if mode is PERMISSIVE?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 2, 2019

then how about from_json always return null for corrupted record if mode is PERMISSIVE?

Does always mean we should return nulls for structs, right? So, it actually means, need to revert commits related to partial results #23253 & #23120 and for sure #22938.

Frankly speaking, I would avoid this. Just to be clear, the main motivation for rejection of this PR and reverting 3 above is the columnNameOfCorruptRecord option cannot be supported for maps/arrays in the PERMISSIVE mode for from_json, am I right?

@cloud-fan
Copy link
Contributor

yes. I think it's more important to make the behavior of returning struct/array/map in from_json consistent, than making from_json and json data source consistent.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100881 has finished for PR 23325 at commit 4ea4f31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2019

I am going to close this PR since it stuck.

@HyukjinKwon
Copy link
Member

Can we just revert 38628dd and try to change it again? Yea, it's somehow stuck ..

@cloud-fan cloud-fan closed this in 33b5039 Jan 15, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This PR reverts  apache#22938 per discussion in apache#23325

Closes apache#23325

Closes apache#23543 from MaxGekk/return-nulls-from-json-parser.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the json-ignore-no-tokens branch August 17, 2019 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants