-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50858][PYTHON] Add configuration to hide Python UDF stack trace #49535
base: master
Are you sure you want to change the base?
Changes from 6 commits
1e2a62c
e038bb8
c5999fb
31261b1
3f509be
c4f40b0
8901680
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -462,22 +462,40 @@ def wrapped(*args: Any, **kwargs: Any) -> Any: | |
return f # type: ignore[return-value] | ||
|
||
|
||
def handle_worker_exception(e: BaseException, outfile: IO) -> None: | ||
def handle_worker_exception( | ||
e: BaseException, outfile: IO, hide_traceback: Optional[bool] = None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's optional, and uses the value of SPARK_HIDE_TRACEBACK by default (see docstring) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any place that passes this parameter except for the tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can add more tests for the override behaviour |
||
) -> None: | ||
""" | ||
Handles exception for Python worker which writes SpecialLengths.PYTHON_EXCEPTION_THROWN (-2) | ||
and exception traceback info to outfile. JVM could then read from the outfile and perform | ||
exception handling there. | ||
|
||
Parameters | ||
---------- | ||
e : BaseException | ||
Exception handled | ||
outfile : IO | ||
IO object to write the exception info | ||
hide_traceback : bool, optional | ||
Whether to hide the traceback in the output. | ||
By default, hides the traceback if environment variable SPARK_HIDE_TRACEBACK is set. | ||
Comment on lines
+473
to
+481
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding the parameters here! |
||
""" | ||
try: | ||
exc_info = None | ||
|
||
if hide_traceback is None: | ||
hide_traceback = bool(os.environ.get("SPARK_HIDE_TRACEBACK", False)) | ||
|
||
def format_exception() -> str: | ||
if hide_traceback: | ||
return "".join(traceback.format_exception_only(type(e), e)) | ||
if os.environ.get("SPARK_SIMPLIFIED_TRACEBACK", False): | ||
tb = try_simplify_traceback(sys.exc_info()[-1]) # type: ignore[arg-type] | ||
if tb is not None: | ||
e.__cause__ = None | ||
exc_info = "".join(traceback.format_exception(type(e), e, tb)) | ||
if exc_info is None: | ||
exc_info = traceback.format_exc() | ||
return "".join(traceback.format_exception(type(e), e, tb)) | ||
return traceback.format_exc() | ||
|
||
try: | ||
exc_info = format_exception() | ||
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) | ||
write_with_length(exc_info.encode("utf-8"), outfile) | ||
except IOError: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3475,6 +3475,15 @@ object SQLConf { | |
.checkValues(Set("legacy", "row", "dict")) | ||
.createWithDefaultString("legacy") | ||
|
||
val PYSPARK_HIDE_TRACEBACK = | ||
buildConf("spark.sql.execution.pyspark.udf.hideTraceback.enabled") | ||
.doc( | ||
"When true, only show the message of the exception from Python UDFs, " + | ||
"hiding the stack trace.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want to describe a bit more about the relationship between this and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, simplifiedTraceback is not applicable if hideTraceback is set. Unless caller sets parameter hide_traceback=False to override the config. I'll update the description to reflect this. |
||
.version("4.0.0") | ||
.booleanConf | ||
.createWithDefault(false) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another way is to create this conf as an int, and show the max depth of stacktrace but I don't feel strongly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a use case where we only want to show only last k frames of the stack? I'm under the impression that we want to show full stack trace for most exceptions, and completely hide stack trace for specific library exceptions when the message is sufficient to identify the reason. |
||
|
||
val PYSPARK_SIMPLIFIED_TRACEBACK = | ||
buildConf("spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled") | ||
.doc( | ||
|
@@ -6286,6 +6295,8 @@ class SQLConf extends Serializable with Logging with SqlApiConf { | |
|
||
def pandasStructHandlingMode: String = getConf(PANDAS_STRUCT_HANDLING_MODE) | ||
|
||
def pysparkHideTraceback: Boolean = getConf(PYSPARK_HIDE_TRACEBACK) | ||
|
||
def pysparkSimplifiedTraceback: Boolean = getConf(PYSPARK_SIMPLIFIED_TRACEBACK) | ||
|
||
def pandasGroupedMapAssignColumnsByName: Boolean = | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use
conf.get(PYSPARK_HIDE_TRACEBACK)
here so that we don't need to override every subclass?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is defined in
org.apache.spark.sql.internal.SQLConf
which seems to be inaccessible from here. For reference,PYSPARK_SIMPLIFIED_TRACEBACK
is also defined inSQLConf
soBasePythonRunner
subclasses have to override it.Is there an advantage for putting it in
SQLConf
rather than e.g.org.apache.spark.internal.config.Python
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conf in
SQLConf
is session-based conf that also can be set in runtime, and any conf incore
module orStaticSQLConf
is cluster-wide conf and can't be changed while the cluster is running.