[SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes #42564

MaxGekk · 2023-08-18T19:23:57Z

What changes were proposed in this pull request?

In the PR, I propose to make the array_insert function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config spark.sql.legacy.negativeIndexInArrayInsert.

Why are the changes needed?

To match the behaviour of functions such as substr() and element_at().

spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b

To fix an inconsistency in array_insert in which positive indexes are 1-based, but negative indexes are 0-based.

Does this PR introduce any user-facing change?

Yes.

Before:

spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]

After:

spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]

How was this patch tested?

By running the modified test suite:

$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"

MaxGekk · 2023-08-21T10:02:47Z

@cloud-fan @peter-toth @srielau @srowen Could you review this PR, please.

srowen · 2023-08-21T11:42:21Z

docs/sql-migration-guide.md

@@ -28,6 +28,7 @@ license: |
 - Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
 - Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
 - Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
+- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.


I don't think we can change behavior in a minor release. The default has to be to keep current behavior.
The default could change in 4.0

If we treat it as a correctness bug, then it's fine to fix the result by default, but providing a legacy config to fallback.

To me, it does seem like a correctness bug, as array_insert uses 1-based positive index, but 0-based negative index.

There are other examples as well:

Since Spark 3.3, the `histogram_numeric` function in Spark SQL returns an output type of an array of structs (x, y), where the type of the 'x' field in the return value is propagated from the input values consumed in the aggregate function. In Spark 3.2 or earlier, 'x' always had double type. Optionally, use the configuration `spark.sql.legacy.histogramNumericPropagateInputType` since Spark 3.3 to revert back to the previous behavior. In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.

We don't wait for the major version bump to fix the wrong behaviors.

As an aside - I'm still not convinced this is 'incorrect'. The current behavior matches Python's, for example. Is the only other example Snowflake, where inserting at -1 means 'end of list'? If insert's semantics are "insert before position x" then the current behavior seems correct: the last element is element -1, and it ends up in front of it.

What is the problem, what can't you do, what confusion does the current behavior cause? This has been the behavior for a while and doesn't seem to have been an issue.

We can debate this, but, this is a behavior change, so the bar should be high - especially if suggesting this must happen in a minor version!

The reason I think this is actually more problematic to change is:

anyone using negative indices and array_insert will be broken by this change; you have to rely on the 'wrong' behavior to use it, so are expecting it, unlike other correctness bugs

The change in behavior is potentially subtle; your query doesn't fail, it just silently starts returning different results, with numbers or strings in a different place than you intend

The behavior is inconsistent with itself.
IFF array_insert() and (other functions) were 0-based then yes it would make sense that 1 is the second element and -1 the second last element.
So to fix this function we either have to make it 0 based or change the behavior for negative indices.
Against going 0-based speaks:

1-based was and remains a conscious decision

We can agree (I think) that negative indices are less common as opposed to positive indices

Our own docs actually don't have negative index examples and the instructions in the docs are ambiguous.

Lastly the function is relatively new, the blast radius is limited.

Re the proposed (existing)semantic "It goes 'before' the position" I don't think we have anything like that anywhere.

Is this the argument: If I insert at position X, then afterwards, element_at X should be that new element?
Then 0-based vs 1-based indices are not really relevant; as it happens, everyone agrees that -1 is the last element, not -0. If that's the argument then I agree.

If the semantics are meant to be: insert before position X, then the current behavior is correct.

If the semantics are meant to be: insert before X, or after X if X is negative (i.e. 'before' but in reverse) then the change is also correct. But that doesn't sound right.

The docs say 'at position X' which leads me to believe the first interpretation is what was meant, but it doesn't really relate to 0- or 1-based indices.

I agree the blast radius is small, but that cuts both ways.

Can we really say this is a correctness fix? You'd have to argue that everyone has expected the new 'correct' behavior, but were getting the old behavior. But both are plausible, and user code must actually rely on the current behavior now.

Is there some other context that makes this urgent for 3.5 which is already at RC2?

Note: array_insert with position parameter 0 fails with INVALID_INDEX_OF_ZERO. It's nonsense that you can put an element at the beginning with position 1, but you can't put an element at the end with any negative position. This inconsistency is confusing and sometimes makes the function hard to use.

The reason to fix it in 3.5 is to reduce the impact. We don't want to have one more release (3.5.0) that has the wrong behavior.

That's a good observation. Personally I buy the argument that the insert position should be where it can be retrieved subsequently. OK I can see the argument to change it in 3.5.0 to limit affected versions.

@srowen You are ok to merge it to branch-3.5, correct?

…stener and UI" This reverts commit 1f92995.

MaxGekk · 2023-08-22T18:03:56Z

Merging to master. The PR causes some conflicts in branch-3.5. I will open a separate PR for 3.5.
Thank you @srielau @cloud-fan @srowen @peter-toth for review.

In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ce50a56) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of #42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes #42616 from MaxGekk/fix-array_insert-3.5-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit ce50a56) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of #42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit ce50a56) Closes #42655 from MaxGekk/fix-array_insert-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…dexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of apache#42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit ce50a56) Closes apache#42655 from MaxGekk/fix-array_insert-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes apache#42564 from MaxGekk/fix-array_insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

Make array_insert 1-based for negative indexes

65a45f1

github-actions bot added the SQL label Aug 18, 2023

MaxGekk added 2 commits August 18, 2023 22:50

Re-gen sql.out

e78d7de

Fix python test

31f4dbd

github-actions bot added the PYTHON label Aug 19, 2023

MaxGekk added 5 commits August 19, 2023 15:53

Add a SQL config

fef02dc

Pass the SQL config

8a2dcad

Improve comments and add an example.

415aa2c

Re-gen sql.out

39f6915

Re-gen *.explain

9e106b1

github-actions bot added the CONNECT label Aug 19, 2023

Test legacy behaviour

41bc7bc

MaxGekk changed the title ~~[WIP][SPARK-44840][SQL] Make array_insert 1-based for negative indexes~~ [SPARK-44840][SQL] Make array_insert 1-based for negative indexes Aug 21, 2023

MaxGekk marked this pull request as ready for review August 21, 2023 10:01

MaxGekk requested review from peter-toth, cloud-fan and srowen August 21, 2023 10:02

MaxGekk changed the title ~~[SPARK-44840][SQL] Make array_insert 1-based for negative indexes~~ [SPARK-44840][SQL] Make array_insert() 1-based for negative indexes Aug 21, 2023

Update the migration guide

ca71d4d

github-actions bot added the DOCS label Aug 21, 2023

srowen requested changes Aug 21, 2023

View reviewed changes

MaxGekk added 3 commits August 21, 2023 19:26

Merge remote-tracking branch 'origin/master' into fix-array_insert

f8038a6

Revert "[SPARK-44801][SQL][UI] Capture analyzing failed queries in Li…

02664af

…stener and UI" This reverts commit 1f92995.

Merge remote-tracking branch 'origin/master' into fix-array_insert

8208c47

cloud-fan approved these changes Aug 22, 2023

View reviewed changes

peter-toth approved these changes Aug 22, 2023

View reviewed changes

MaxGekk added 2 commits August 22, 2023 12:55

Merge remote-tracking branch 'origin/master' into fix-array_insert

db447d7

Merge remote-tracking branch 'origin/master' into fix-array_insert

1ae7a69

MaxGekk closed this in ce50a56 Aug 22, 2023

MaxGekk mentioned this pull request Aug 22, 2023

[SPARK-44840][SQL][3.5] Make array_insert() 1-based for negative indexes #42616

Closed

MaxGekk mentioned this pull request Aug 24, 2023

[SPARK-44840][SQL][3.4] Make array_insert() 1-based for negative indexes #42655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes #42564

[SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes #42564

MaxGekk commented Aug 18, 2023 •

edited

Loading

MaxGekk commented Aug 21, 2023

srowen Aug 21, 2023

cloud-fan Aug 21, 2023

srowen Aug 21, 2023

srielau Aug 21, 2023 •

edited

Loading

srowen Aug 21, 2023 •

edited

Loading

cloud-fan Aug 21, 2023

srowen Aug 21, 2023

MaxGekk Aug 22, 2023

MaxGekk commented Aug 22, 2023

[SPARK-44840][SQL] Make array_insert() 1-based for negative indexes #42564

[SPARK-44840][SQL] Make array_insert() 1-based for negative indexes #42564

Conversation

MaxGekk commented Aug 18, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Aug 21, 2023

srowen Aug 21, 2023

Choose a reason for hiding this comment

cloud-fan Aug 21, 2023

Choose a reason for hiding this comment

srowen Aug 21, 2023

Choose a reason for hiding this comment

srielau Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

srowen Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

cloud-fan Aug 21, 2023

Choose a reason for hiding this comment

srowen Aug 21, 2023

Choose a reason for hiding this comment

MaxGekk Aug 22, 2023

Choose a reason for hiding this comment

MaxGekk commented Aug 22, 2023

[SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes #42564

[SPARK-44840][SQL] Make `array_insert()` 1-based for negative indexes #42564

MaxGekk commented Aug 18, 2023 •

edited

Loading

srielau Aug 21, 2023 •

edited

Loading

srowen Aug 21, 2023 •

edited

Loading