Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49902][SQL] Catch underlying runtime errors in RegExpReplace #48379

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

Earlier, runtime errors in underlying libraries were not caught during runtime in the RegExpReplace expression. The underlying errors were thrown directly to the user. For example, it wouldn't be uncommon to see issues like java.lang.IndexOutOfBoundsException: No group 3. This PR introduces a change to catch these underlying issues and throw a SparkException instead which details the input on which the exception failed. The new Spark Exception looks something like org.apache.spark.SparkException: Could not perform regexp_replace for source = <source>, pattern = <pattern>, replacement = <replacement> and position = <position>.

Why are the changes needed?

Two reasons. First, the new exception details which row the given error occurred on, which makes it easier for the user to debug the query or Spark developers to identify bugs. Second, a Spark Exception is generally considered expected behavior indicating that there were no unintended issues in the query's execution.

Does this PR introduce any user-facing change?

Yes, a better exception is thrown when RegExpReplace fails.

How was this patch tested?

Unit test in both codegen as well as interpreted mode.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot removed the CONNECT label Oct 8, 2024
@harshmotw-db harshmotw-db marked this pull request as ready for review October 8, 2024 02:07
@harshmotw-db
Copy link
Contributor Author

@cloud-fan Can you look at this PR? Thanks!

try {
m.appendReplacement(result, lastReplacement)
} catch {
case e: Exception =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you catch more specific exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure. Based on what I can see in the underlying Java file, I think the most restrictive choice we can make is RuntimeException which is just one level above Exception. But I still think Exception is the safest choice since there might be potential errors that don't fall in the RuntimeExceptions category.

sql(s"CREATE TABLE IF NOT EXISTS $tableName(s STRING)")
sql(s"INSERT INTO $tableName VALUES('first last')")
val query = s"SELECT regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', " +
s"'$$3 $$1') FROM $tableName"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Is there any other databases support regexp_replace(s, '(?<first>[a-zA-Z]+) (?<last>[a-zA-Z]+)', '$$3 $$1') ?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that MySQL behaves similar to SparkSQL.
Valid Query (notice $2 instead of $3):
Screenshot 2024-10-16 at 3 18 57 PM

Invalid Query (notice $3):
Screenshot 2024-10-16 at 3 19 25 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the designer of RegExpReplace didn't realize the feature(In fact, a bug here). We should forbid it or consider a better way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants