-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26376][SQL] Skip inputs without tokens by JSON datasource #23325
Conversation
@cloud-fan Please, review the PR. |
Test build #100193 has finished for PR 23325 at commit
|
dummySchema, | ||
dummyOption, | ||
allowArrayAsStructs = true, | ||
skipInputWithoutTokens = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a test coverage for both true
and false
? In case of false
, we should not skip the rows, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's handled by from_json
tests. If not, let's add a new UT here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both cases are covered already, for example:
skipInputWithoutTokens == false
Lines 547 to 553 in 3238e3d
test("SPARK-19543: from_json empty input column") { val schema = StructType(StructField("a", IntegerType) :: Nil) checkEvaluation( JsonToStructs(schema, Map.empty, Literal.create(" ", StringType), gmtId), InternalRow(null) ) } skipInputWithoutTokens == true
https://github.com/apache/spark/pull/23325/files#diff-fde14032b0e6ef8086461edf79a27c5dR2520
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying that, @MaxGekk and @cloud-fan .
@@ -17,7 +17,7 @@ displayTitle: Spark SQL Upgrading Guide | |||
|
|||
- Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. | |||
|
|||
- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. | |||
- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings without valid root JSON tokens (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping
seems to be unclear here. Could you elaborate the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to compare JSON datasource and json functions in this note?
+1 for fixing this behavior change, thanks! |
How about we just revert 38628dd? |
revert it and reopen a PR with the original commit and this fix? I'm fine with it, but I'm not sure if it's a clean revert. IIRC there are multiple JSON related PRs merged recently. |
Hm, I wasn't able to follow why we keep the behaviour in JSON datasource but not in |
IIUC #22938 tried to change the behavior of This PR is to fix this problem. |
Yea, correct. One of the behaviour changes was, After the change, an empty string (no JSON token) produces a row of nulls ( Before the change:
However, we're reverting one case of both now. |
These 2 cases can't be consistent. It's arguable if json data source should follow while for |
Hm, the point of view was different. Yea, now I see they can't be consistent. To me, I didn't stay against in #22938 because the changes looked at least coherent because they look consistent at that time.
I think this one is also kind of arguable as well .. Thing is, we currently return If this is the change alone at #22938, I actually would have been hesitant about going ahead .. |
ah this is a good point. I think PERMISSIVE mode doesn't make sense for array/map as we can't have a special column to put the original token. Now we have several things to consider together to decide the behavior:
@MaxGekk can you describe the behavior you proposed? |
@cloud-fan I updated PR's description. |
I think this is the most arguable part. The current behavior looks not reasonable, I can think of 2 options:
@HyukjinKwon @MaxGekk any preference? |
In general, returning
This looks like more consistent approach across supported types but even it raises some questions:
If need to make a choice of the 2 approaches above, I would prefer the first one probably. And I would re-implement the |
@cloud-fan Can we move forward with the particular changes in the PR? |
It seems also reasonable to accept @HyukjinKwon 's proposal: just revert 38628dd |
ok. Let me try to revert the commit locally. I guess it won't be reverted smoothly. If not, I could create a PR for that. |
@cloud-fan Revert causes conflicts in the migration guide. Let me know if you will need a PR. |
@HyukjinKwon what do you think? |
Yea, reverting sounds good to me .. if you guys don't strongly feel about a specific way. |
Let's imagine a situation when an user uses |
if it's only for troubleshooting, I guess users can do My major concern is, |
I still think the |
then how about |
Does Frankly speaking, I would avoid this. Just to be clear, the main motivation for rejection of this PR and reverting 3 above is the |
yes. I think it's more important to make the behavior of returning struct/array/map in |
Test build #100881 has finished for PR 23325 at commit
|
I am going to close this PR since it stuck. |
Can we just revert 38628dd and try to change it again? Yea, it's somehow stuck .. |
## What changes were proposed in this pull request? This PR reverts apache#22938 per discussion in apache#23325 Closes apache#23325 Closes apache#23543 from MaxGekk/return-nulls-from-json-parser. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Added new internal flag for
JacksonParser
-skipInputWithoutTokens
to control parser's behavior when its input does not contain any valid JSON tokens. The flag is set totrue
for JSON datasource and enables the same behavior of the datasource as it has in Spark 2.4 and earlier. The flag is set tofalse
for JSON functions likefrom_json
.The flag impacts only on handling bad JSON records without valid JSON tokens on the root level:
PERMISSIVE
,FAILFAST
andDROPMALFORMED
modes. JSON datasource does not supportArrayType
andMapType
as the root type. ForStructType
, bad records are skipped and excluded from results.from_json
:FAILFAST
mode, throwsSparkException
for all supported root types -StructType
,ArrayType
andMapType
PERMISSIVE
mode return aRow
withnull
s for all fields if specified root type isStructType
(bad record string is placed to the corrupt column if it is specified), andnull
forArrayType
andMapType
.DROPMALFORMED
mode is not supported.Summary: The PR changes behavior of JSON datasource in the
PERMISSIVE
andFAILFAST
modes by skipping bad records without valid JSON tokens.How was this patch tested?
It was tested by
JsonSuite
.