Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44840][SQL] Make array_insert() 1-based for negative indexes #42564

Closed
wants to merge 15 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Aug 18, 2023

What changes were proposed in this pull request?

In the PR, I propose to make the array_insert function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config spark.sql.legacy.negativeIndexInArrayInsert.

Why are the changes needed?

  1. To match the behaviour of functions such as substr() and element_at().
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
  1. To fix an inconsistency in array_insert in which positive indexes are 1-based, but negative indexes are 0-based.

Does this PR introduce any user-facing change?

Yes.

Before:

spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]

After:

spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]

How was this patch tested?

By running the modified test suite:

$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"

@github-actions github-actions bot added the SQL label Aug 18, 2023
@MaxGekk MaxGekk changed the title [WIP][SPARK-44840][SQL] Make array_insert 1-based for negative indexes [SPARK-44840][SQL] Make array_insert 1-based for negative indexes Aug 21, 2023
@MaxGekk MaxGekk marked this pull request as ready for review August 21, 2023 10:01
@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 21, 2023

@cloud-fan @peter-toth @srielau @srowen Could you review this PR, please.

@MaxGekk MaxGekk changed the title [SPARK-44840][SQL] Make array_insert 1-based for negative indexes [SPARK-44840][SQL] Make array_insert() 1-based for negative indexes Aug 21, 2023
@github-actions github-actions bot added the DOCS label Aug 21, 2023
@@ -28,6 +28,7 @@ license: |
- Since Spark 3.5, Spark thrift server will interrupt task when canceling a running statement. To restore the previous behavior, set `spark.sql.thriftServer.interruptOnCancel` to `false`.
- Since Spark 3.5, Row's json and prettyJson methods are moved to `ToJsonUtil`.
- Since Spark 3.5, the `plan` field is moved from `AnalysisException` to `EnhancedAnalysisException`.
- Since Spark 3.5, the `array_insert` function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set `spark.sql.legacy.negativeIndexInArrayInsert` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can change behavior in a minor release. The default has to be to keep current behavior.
The default could change in 4.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat it as a correctness bug, then it's fine to fix the result by default, but providing a legacy config to fallback.

To me, it does seem like a correctness bug, as array_insert uses 1-based positive index, but 0-based negative index.

There are other examples as well:

Since Spark 3.3, the `histogram_numeric` function in Spark SQL returns an output type of an array of structs (x, y), where the type of the 'x' field in the return value is propagated from the input values consumed in the aggregate function. In Spark 3.2 or earlier, 'x' always had double type. Optionally, use the configuration `spark.sql.legacy.histogramNumericPropagateInputType` since Spark 3.3 to revert back to the previous behavior.

In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.

We don't wait for the major version bump to fix the wrong behaviors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside - I'm still not convinced this is 'incorrect'. The current behavior matches Python's, for example. Is the only other example Snowflake, where inserting at -1 means 'end of list'? If insert's semantics are "insert before position x" then the current behavior seems correct: the last element is element -1, and it ends up in front of it.

What is the problem, what can't you do, what confusion does the current behavior cause? This has been the behavior for a while and doesn't seem to have been an issue.

We can debate this, but, this is a behavior change, so the bar should be high - especially if suggesting this must happen in a minor version!

The reason I think this is actually more problematic to change is:

  • anyone using negative indices and array_insert will be broken by this change; you have to rely on the 'wrong' behavior to use it, so are expecting it, unlike other correctness bugs
  • The change in behavior is potentially subtle; your query doesn't fail, it just silently starts returning different results, with numbers or strings in a different place than you intend

Copy link
Contributor

@srielau srielau Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is inconsistent with itself.
IFF array_insert() and (other functions) were 0-based then yes it would make sense that 1 is the second element and -1 the second last element.
So to fix this function we either have to make it 0 based or change the behavior for negative indices.
Against going 0-based speaks:

  1. 1-based was and remains a conscious decision
  2. We can agree (I think) that negative indices are less common as opposed to positive indices
  3. Our own docs actually don't have negative index examples and the instructions in the docs are ambiguous.

Lastly the function is relatively new, the blast radius is limited.

Re the proposed (existing)semantic "It goes 'before' the position" I don't think we have anything like that anywhere.

Copy link
Member

@srowen srowen Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the argument: If I insert at position X, then afterwards, element_at X should be that new element?
Then 0-based vs 1-based indices are not really relevant; as it happens, everyone agrees that -1 is the last element, not -0. If that's the argument then I agree.

If the semantics are meant to be: insert before position X, then the current behavior is correct.

If the semantics are meant to be: insert before X, or after X if X is negative (i.e. 'before' but in reverse) then the change is also correct. But that doesn't sound right.

The docs say 'at position X' which leads me to believe the first interpretation is what was meant, but it doesn't really relate to 0- or 1-based indices.

I agree the blast radius is small, but that cuts both ways.

Can we really say this is a correctness fix? You'd have to argue that everyone has expected the new 'correct' behavior, but were getting the old behavior. But both are plausible, and user code must actually rely on the current behavior now.

Is there some other context that makes this urgent for 3.5 which is already at RC2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: array_insert with position parameter 0 fails with INVALID_INDEX_OF_ZERO. It's nonsense that you can put an element at the beginning with position 1, but you can't put an element at the end with any negative position. This inconsistency is confusing and sometimes makes the function hard to use.

The reason to fix it in 3.5 is to reduce the impact. We don't want to have one more release (3.5.0) that has the wrong behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good observation. Personally I buy the argument that the insert position should be where it can be retrieved subsequently. OK I can see the argument to change it in 3.5.0 to limit affected versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen You are ok to merge it to branch-3.5, correct?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 22, 2023

Merging to master. The PR causes some conflicts in branch-3.5. I will open a separate PR for 3.5.
Thank you @srielau @cloud-fan @srowen @peter-toth for review.

@MaxGekk MaxGekk closed this in ce50a56 Aug 22, 2023
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Aug 22, 2023
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Closes apache#42564 from MaxGekk/fix-array_insert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ce50a56)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
dongjoon-hyun pushed a commit that referenced this pull request Aug 23, 2023
…dexes

### What changes were proposed in this pull request?
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

This is a backport of #42564

### Why are the changes needed?
1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Closes #42616 from MaxGekk/fix-array_insert-3.5-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

### Why are the changes needed?
1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Closes apache#42564 from MaxGekk/fix-array_insert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Aug 24, 2023
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Closes apache#42564 from MaxGekk/fix-array_insert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ce50a56)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
dongjoon-hyun pushed a commit that referenced this pull request Aug 24, 2023
…dexes

### What changes were proposed in this pull request?
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

This is a backport of #42564

### Why are the changes needed?
1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Authored-by: Max Gekk <max.gekkgmail.com>
(cherry picked from commit ce50a56)

Closes #42655 from MaxGekk/fix-array_insert-3.4.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…dexes

### What changes were proposed in this pull request?
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

This is a backport of apache#42564

### Why are the changes needed?
1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Authored-by: Max Gekk <max.gekkgmail.com>
(cherry picked from commit ce50a56)

Closes apache#42655 from MaxGekk/fix-array_insert-3.4.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?
In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1.

The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`.

### Why are the changes needed?
1.  To match the behaviour of functions such as `substr()` and `element_at()`.
```sql
spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1);
b	b
```
2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","c","b"]
```

After:
```sql
spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c');
["a","b","c"]
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *CollectionExpressionsSuite"
$ build/sbt "test:testOnly *DataFrameFunctionsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

Closes apache#42564 from MaxGekk/fix-array_insert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants