Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39107][SQL] Account for empty string input in regex replace #36457

Conversation

LorenzoMartini
Copy link
Contributor

@LorenzoMartini LorenzoMartini commented May 5, 2022

What changes were proposed in this pull request?

When trying to perform a regex replace, account for the possibility of having empty strings as input.

Why are the changes needed?

#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)

From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2

scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   | <empty>|
+---+--------+

3.1.2

scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   |        |
+---+--------+

The 3.0.2 outcome is the expected and correct one

Does this PR introduce any user-facing change?

Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.

How was this patch tested?

Added special casing test in RegexpExpressionsSuite.RegexReplace with empty string replacement.

@github-actions github-actions bot added the SQL label May 5, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon HyukjinKwon changed the title [SPARK-39107] Account for empty string input in regex replace [SPARK-39107][SQL] Account for empty string input in regex replace May 6, 2022
@HyukjinKwon
Copy link
Member

cc @beliefer too

@@ -642,7 +642,7 @@ case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expressio
}
val source = s.toString()
val position = i.asInstanceOf[Int] - 1
if (position < source.length) {
if (position < source.length || (position == 0 && source.equals(""))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, but check source.length == 0 instead? faster.
In fact, isn't this equivalent to position == 0 || position < source.length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point. Compiler actually suggests source.isEmpty(). The default position passed in is 1 so position will be 0 by default so position == 0 || position < source.length doesn't catch the empty string.
I will replace with isEmpty.

I would also argue that any non-default position passed in with an empty string for the regex replace is an error on the user side, so I'm wondering if we should special case that?
As in, if user gives us a regex replace with empty string match (^$) and explicitly sets a position, we should probably just throw? I don't really have a strong opinion here, happy to keep as is. Seems just a bit odd to specify a position when you are matching on an empty string that should always have length 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to isEmpty in 0e8d785. Thanks for the look @srowen , let me know what you think :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, how doesn't it catch the empty string? in that case, adding position == 0 || makes the check true iff the string is empty (position < source.length is false)

Copy link
Contributor Author

@LorenzoMartini LorenzoMartini May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIC position is just i-1 and that is just the position the user asks the regex replace to start from no?

case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expression, pos: Expression)
And i would be 1 by default if users don't specify a position (
def this(subject: Expression, regexp: Expression, rep: Expression) =
this(subject, regexp, rep, Literal(1))
). So if user doesn't specify pos, then it would always be 1 and therefore position will always be 0.

So ultimately the check position==0 is not addressing the empty string case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If position is 0, then the check is currently true in all cases except source.length == 0 (empty string, right?). Your change makes it true in this case too when position is 0. I think both versions do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean now! Yes you are right here, sorry my bad! Will change to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2280c24

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks OK as a bug fix. I don't know about whether further restrictions are right or not -- maybe so. If in doubt let's just leave this as is to start.

@LorenzoMartini
Copy link
Contributor Author

Thanks @srowen! Sounds good to me, the constraint would just be a nit and I'm happy to keep it as is to avoid additional complications / excessive fixing. Can we merge this :)?

@srowen
Copy link
Member

srowen commented May 6, 2022

Seems OK; let me pause a bit to see if @beliefer wants to weigh in

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LorenzoMartini Thank you for the fix.

checkEvaluation(emptyStringWithPositionOne, "<empty string>", create_row(""))
val emptyStringWithPositionGreater =
RegExpReplace(Literal(""), Literal("^$"), Literal("<empty string>"), 2)
checkEvaluation(emptyStringWithPositionGreater, "", create_row(""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases should follows the others.

val row7 = create_row("", "^$", "<empty string>")
...
checkEvaluation(expr, "<empty string>", row7)
...
checkEvaluation(exprWithExceedLength, "", row7)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a special case testing for the nonNullExpr so I followed that pattern instead of grouping with the other cases, since this is also a special case and wanted to keep it separate.
I am happy to change it though, should I just add row7 above with the other rows and the checkEvaluation bits together with the other checkEvauation bits then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. we can keep it separate.

Copy link
Contributor

@beliefer beliefer May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to change it though, should I just add row7 above with the other rows and the checkEvaluation bits together with the other checkEvauation bits then?

Looks good. Could you change them and let me see later ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer I changed the test to be grouped with the others in e2d9948. Let me know if that's good :)!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LorenzoMartini LGTM although I see it later.

@beliefer
Copy link
Contributor

beliefer commented May 7, 2022

Seems OK; let me pause a bit to see if @beliefer wants to weigh in

@srowen Thank you for your ping.

@LorenzoMartini
Copy link
Contributor Author

Github UI shows failing but nothing is actually failing here. All comments should be addressed, can we get a merge on this please?

@srowen
Copy link
Member

srowen commented May 9, 2022

All tests show that they pass. I'll wait a beat for last comments, but should be OK.

@srowen
Copy link
Member

srowen commented May 10, 2022

Question for, maybe, @dongjoon-hyun -- this is basically a bug fix for a behavior change in 3.1.0. I'd merge this to master and 3.3, but what about 3.1.x and 3.2.x? because fixing this does change behavior again. I'm personally a bit in favor of back-porting all the way, but just seeing if anyone has more thoughts.

@dongjoon-hyun
Copy link
Member

I'm also +1 for fixing this in all applicable branches, @srowen .

@dongjoon-hyun
Copy link
Member

BTW, cc @MaxGekk for Apache Spark 3.3.0.

@srowen srowen closed this in 731aa2c May 10, 2022
srowen pushed a commit that referenced this pull request May 10, 2022
### What changes were proposed in this pull request?

When trying to perform a regex replace, account for the possibility of having empty strings as input.

### Why are the changes needed?

#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)

From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   | <empty>|
+---+--------+
```
3.1.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   |        |
+---+--------+
```

The 3.0.2 outcome is the expected and correct one

### Does this PR introduce _any_ user-facing change?

Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.

### How was this patch tested?

Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement.

Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace.

Authored-by: Lorenzo Martini <lmartini@palantir.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 731aa2c)
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request May 10, 2022
### What changes were proposed in this pull request?

When trying to perform a regex replace, account for the possibility of having empty strings as input.

### Why are the changes needed?

#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)

From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   | <empty>|
+---+--------+
```
3.1.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   |        |
+---+--------+
```

The 3.0.2 outcome is the expected and correct one

### Does this PR introduce _any_ user-facing change?

Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.

### How was this patch tested?

Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement.

Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace.

Authored-by: Lorenzo Martini <lmartini@palantir.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 731aa2c)
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request May 10, 2022
### What changes were proposed in this pull request?

When trying to perform a regex replace, account for the possibility of having empty strings as input.

### Why are the changes needed?

#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)

From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   | <empty>|
+---+--------+
```
3.1.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   |        |
+---+--------+
```

The 3.0.2 outcome is the expected and correct one

### Does this PR introduce _any_ user-facing change?

Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.

### How was this patch tested?

Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement.

Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace.

Authored-by: Lorenzo Martini <lmartini@palantir.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 731aa2c)
Signed-off-by: Sean Owen <srowen@gmail.com>
@srowen
Copy link
Member

srowen commented May 10, 2022

Merged to master/3.3/3.2/3.1

kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
### What changes were proposed in this pull request?

When trying to perform a regex replace, account for the possibility of having empty strings as input.

### Why are the changes needed?

apache#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)

From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   | <empty>|
+---+--------+
```
3.1.2
```
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show
+---+--------+
|col|replaced|
+---+--------+
|   |        |
+---+--------+
```

The 3.0.2 outcome is the expected and correct one

### Does this PR introduce _any_ user-facing change?

Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.

### How was this patch tested?

Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement.

Closes apache#36457 from LorenzoMartini/lmartini/fix-empty-string-replace.

Authored-by: Lorenzo Martini <lmartini@palantir.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 731aa2c)
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants