-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace #48379
base: master
Are you sure you want to change the base?
[SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace #48379
Conversation
@cloud-fan Can you look at this PR? Thanks! |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLRegexpSuite.scala
Outdated
Show resolved
Hide resolved
try { | ||
m.appendReplacement(result, lastReplacement) | ||
} catch { | ||
case e: Exception => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you catch more specific exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure. Based on what I can see in the underlying Java file, I think the most restrictive choice we can make is RuntimeException
which is just one level above Exception
. But I still think Exception
is the safest choice since there might be potential errors that don't fall in the RuntimeExceptions
category.
sql(s"CREATE TABLE IF NOT EXISTS $tableName(s STRING)") | ||
sql(s"INSERT INTO $tableName VALUES('first last')") | ||
val query = s"SELECT regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', " + | ||
s"'$$3 $$1') FROM $tableName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Is there any other databases support regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', '$$3 $$1')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the designer of RegExpReplace
didn't realize the feature(In fact, a bug here). We should forbid it or consider a better way.
What changes were proposed in this pull request?
Earlier, runtime errors in underlying libraries were not caught during runtime in the RegExpReplace expression. The underlying errors were thrown directly to the user. For example, it wouldn't be uncommon to see issues like
java.lang.IndexOutOfBoundsException: No group 3
. This PR introduces a change to catch these underlying issues and throw a SparkException instead which details the input on which the exception failed. The new Spark Exception looks something likeorg.apache.spark.SparkException: Could not perform regexp_replace for source = <source>, pattern = <pattern>, replacement = <replacement> and position = <position>
.Why are the changes needed?
Two reasons. First, the new exception details which row the given error occurred on, which makes it easier for the user to debug the query or Spark developers to identify bugs. Second, a Spark Exception is generally considered expected behavior indicating that there were no unintended issues in the query's execution.
Does this PR introduce any user-facing change?
Yes, a better exception is thrown when RegExpReplace fails.
How was this patch tested?
Unit test in both codegen as well as interpreted mode.
Was this patch authored or co-authored using generative AI tooling?
No.