-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39107][SQL] Account for empty string input in regex replace #36457
[SPARK-39107][SQL] Account for empty string input in regex replace #36457
Conversation
Can one of the admins verify this patch? |
cc @beliefer too |
@@ -642,7 +642,7 @@ case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expressio | |||
} | |||
val source = s.toString() | |||
val position = i.asInstanceOf[Int] - 1 | |||
if (position < source.length) { | |||
if (position < source.length || (position == 0 && source.equals(""))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, but check source.length == 0
instead? faster.
In fact, isn't this equivalent to position == 0 || position < source.length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good point. Compiler actually suggests source.isEmpty()
. The default position passed in is 1 so position will be 0 by default so position == 0 || position < source.length
doesn't catch the empty string.
I will replace with isEmpty
.
I would also argue that any non-default position passed in with an empty string for the regex replace is an error on the user side, so I'm wondering if we should special case that?
As in, if user gives us a regex replace with empty string match (^$
) and explicitly sets a position, we should probably just throw? I don't really have a strong opinion here, happy to keep as is. Seems just a bit odd to specify a position when you are matching on an empty string that should always have length 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, how doesn't it catch the empty string? in that case, adding position == 0 ||
makes the check true iff the string is empty (position < source.length
is false
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIC position
is just i-1
and that is just the position the user asks the regex replace to start from no?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Line 598 in 08c07a1
case class RegExpReplace(subject: Expression, regexp: Expression, rep: Expression, pos: Expression) |
i
would be 1 by default if users don't specify a position (spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Lines 601 to 602 in 08c07a1
def this(subject: Expression, regexp: Expression, rep: Expression) = | |
this(subject, regexp, rep, Literal(1)) |
pos
, then it would always be 1 and therefore position
will always be 0.
So ultimately the check position==0
is not addressing the empty string case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If position is 0, then the check is currently true in all cases except source.length == 0
(empty string, right?). Your change makes it true in this case too when position is 0. I think both versions do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean now! Yes you are right here, sorry my bad! Will change to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 2280c24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks OK as a bug fix. I don't know about whether further restrictions are right or not -- maybe so. If in doubt let's just leave this as is to start.
Thanks @srowen! Sounds good to me, the constraint would just be a nit and I'm happy to keep it as is to avoid additional complications / excessive fixing. Can we merge this :)? |
Seems OK; let me pause a bit to see if @beliefer wants to weigh in |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LorenzoMartini Thank you for the fix.
checkEvaluation(emptyStringWithPositionOne, "<empty string>", create_row("")) | ||
val emptyStringWithPositionGreater = | ||
RegExpReplace(Literal(""), Literal("^$"), Literal("<empty string>"), 2) | ||
checkEvaluation(emptyStringWithPositionGreater, "", create_row("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test cases should follows the others.
val row7 = create_row("", "^$", "<empty string>")
...
checkEvaluation(expr, "<empty string>", row7)
...
checkEvaluation(exprWithExceedLength, "", row7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a special case testing for the nonNullExpr
so I followed that pattern instead of grouping with the other cases, since this is also a special case and wanted to keep it separate.
I am happy to change it though, should I just add row7
above with the other rows and the checkEvaluation
bits together with the other checkEvauation
bits then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. we can keep it separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to change it though, should I just add
row7
above with the other rows and thecheckEvaluation
bits together with the othercheckEvauation
bits then?
Looks good. Could you change them and let me see later ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LorenzoMartini LGTM although I see it later.
Github UI shows failing but nothing is actually failing here. All comments should be addressed, can we get a merge on this please? |
All tests show that they pass. I'll wait a beat for last comments, but should be OK. |
Question for, maybe, @dongjoon-hyun -- this is basically a bug fix for a behavior change in 3.1.0. I'd merge this to master and 3.3, but what about 3.1.x and 3.2.x? because fixing this does change behavior again. I'm personally a bit in favor of back-porting all the way, but just seeing if anyone has more thoughts. |
I'm also +1 for fixing this in all applicable branches, @srowen . |
BTW, cc @MaxGekk for Apache Spark 3.3.0. |
### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? #29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2c) Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? #29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2c) Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? #29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2c) Signed-off-by: Sean Owen <srowen@gmail.com>
Merged to master/3.3/3.2/3.1 |
### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? apache#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes apache#36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2c) Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
When trying to perform a regex replace, account for the possibility of having empty strings as input.
Why are the changes needed?
#29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string)
From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions.
3.0.2
3.1.2
The 3.0.2 outcome is the expected and correct one
Does this PR introduce any user-facing change?
Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above.
How was this patch tested?
Added special casing test in
RegexpExpressionsSuite.RegexReplace
with empty string replacement.