-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44840][SQL] Make array_insert()
1-based for negative indexes
#42564
Conversation
@cloud-fan @peter-toth @srielau @srowen Could you review this PR, please. |
array_insert()
1-based for negative indexes
@@ -28,6 +28,7 @@ license: | | |||
- Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`. | |||
- Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`. | |||
- Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`. | |||
- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change behavior in a minor release. The default has to be to keep current behavior.
The default could change in 4.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we treat it as a correctness bug, then it's fine to fix the result by default, but providing a legacy config to fallback.
To me, it does seem like a correctness bug, as array_insert
uses 1-based positive index, but 0-based negative index.
There are other examples as well:
Since Spark 3.3, the `histogram_numeric` function in Spark SQL returns an output type of an array of structs (x, y), where the type of the 'x' field in the return value is propagated from the input values consumed in the aggregate function. In Spark 3.2 or earlier, 'x' always had double type. Optionally, use the configuration `spark.sql.legacy.histogramNumericPropagateInputType` since Spark 3.3 to revert back to the previous behavior.
In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.
We don't wait for the major version bump to fix the wrong behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside - I'm still not convinced this is 'incorrect'. The current behavior matches Python's, for example. Is the only other example Snowflake, where inserting at -1 means 'end of list'? If insert's semantics are "insert before position x" then the current behavior seems correct: the last element is element -1, and it ends up in front of it.
What is the problem, what can't you do, what confusion does the current behavior cause? This has been the behavior for a while and doesn't seem to have been an issue.
We can debate this, but, this is a behavior change, so the bar should be high - especially if suggesting this must happen in a minor version!
The reason I think this is actually more problematic to change is:
- anyone using negative indices and array_insert will be broken by this change; you have to rely on the 'wrong' behavior to use it, so are expecting it, unlike other correctness bugs
- The change in behavior is potentially subtle; your query doesn't fail, it just silently starts returning different results, with numbers or strings in a different place than you intend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is inconsistent with itself.
IFF array_insert() and (other functions) were 0-based then yes it would make sense that 1 is the second element and -1 the second last element.
So to fix this function we either have to make it 0 based or change the behavior for negative indices.
Against going 0-based speaks:
- 1-based was and remains a conscious decision
- We can agree (I think) that negative indices are less common as opposed to positive indices
- Our own docs actually don't have negative index examples and the instructions in the docs are ambiguous.
Lastly the function is relatively new, the blast radius is limited.
Re the proposed (existing)semantic "It goes 'before' the position" I don't think we have anything like that anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the argument: If I insert at position X, then afterwards, element_at X should be that new element?
Then 0-based vs 1-based indices are not really relevant; as it happens, everyone agrees that -1 is the last element, not -0. If that's the argument then I agree.
If the semantics are meant to be: insert before position X, then the current behavior is correct.
If the semantics are meant to be: insert before X, or after X if X is negative (i.e. 'before' but in reverse) then the change is also correct. But that doesn't sound right.
The docs say 'at position X' which leads me to believe the first interpretation is what was meant, but it doesn't really relate to 0- or 1-based indices.
I agree the blast radius is small, but that cuts both ways.
Can we really say this is a correctness fix? You'd have to argue that everyone has expected the new 'correct' behavior, but were getting the old behavior. But both are plausible, and user code must actually rely on the current behavior now.
Is there some other context that makes this urgent for 3.5 which is already at RC2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: array_insert
with position parameter 0 fails with INVALID_INDEX_OF_ZERO
. It's nonsense that you can put an element at the beginning with position 1, but you can't put an element at the end with any negative position. This inconsistency is confusing and sometimes makes the function hard to use.
The reason to fix it in 3.5 is to reduce the impact. We don't want to have one more release (3.5.0) that has the wrong behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good observation. Personally I buy the argument that the insert position should be where it can be retrieved subsequently. OK I can see the argument to change it in 3.5.0 to limit affected versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen You are ok to merge it to branch-3.5, correct?
Merging to master. The PR causes some conflicts in |
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ce50a56) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of #42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes #42616 from MaxGekk/fix-array_insert-3.5-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ce50a56) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of #42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit ce50a56) Closes #42655 from MaxGekk/fix-array_insert-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of apache#42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit ce50a56) Closes apache#42655 from MaxGekk/fix-array_insert-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
In the PR, I propose to make the
array_insert
function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.The old behaviour can be restored via the SQL config
spark.sql.legacy.negativeIndexInArrayInsert
.Why are the changes needed?
substr()
andelement_at()
.array_insert
in which positive indexes are 1-based, but negative indexes are 0-based.Does this PR introduce any user-facing change?
Yes.
Before:
After:
How was this patch tested?
By running the modified test suite: