Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47681][FOLLOWUP] Fix schema_of_variant for float inputs. #47058

Closed

Conversation

chenhao-db
Copy link
Contributor

@chenhao-db chenhao-db commented Jun 21, 2024

What changes were proposed in this pull request?

The current schema_of_variant depends on JsonInferSchema.compatibleType to find the common type of two types. This function doesn't handle the case of float x decimal or decimal x float correctly. It doesn't produce the expected result, but consider the two types as incompatible and produces variant. This change doesn't affect the JSON schema inference because it never produces float beforehand.

Why are the changes needed?

It is a bug fix and is required to process floats correctly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A new unit test that checks all type combinations.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 21, 2024
@chenhao-db
Copy link
Contributor Author

@cloud-fan could you help review? Thanks!

@@ -372,6 +372,11 @@ object JsonInferSchema {
case (DoubleType, _: DecimalType) | (_: DecimalType, DoubleType) =>
DoubleType

// This branch is only used by `SchemaOfVariant.mergeSchema` because `JsonInferSchema` never
// produces `FloatType`.
case (FloatType, _: DecimalType) | (_: DecimalType, FloatType) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard for this? These two are not compatible, because float is approximate but decimal is not.

Anyway, I'm fine with this as we already did it for double type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of an existing standard, but I think we can follow the type resolution rules in the add operator:
image

Based on this result, I think it may be better to change the result of float x decimal into double.

Changing decimal into float/double may indeed lose precision, but I think it is a reasonable approach in the shcema inference.

@chenhao-db
Copy link
Contributor Author

@cloud-fan could you help merge it? Thanks!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4663b84 Jun 24, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

The current `schema_of_variant` depends on `JsonInferSchema.compatibleType` to find the common type of two types. This function doesn't handle the case of `float x decimal` or `decimal x float` correctly. It doesn't produce the expected result, but consider the two types as incompatible and produces `variant`. This change doesn't affect the JSON schema inference because it never produces `float` beforehand.

### Why are the changes needed?

It is a bug fix and is required to process floats correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new unit test that checks all type combinations.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47058 from chenhao-db/fix_schema_of_variant_float.

Authored-by: Chenhao Li <chenhao.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants