-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPython profiler broken with TensorFlow 2.17.0 code in Python 3.12.1+ #122029
Comments
Spent a few hours in this issue and I think I know at least part of the issue. The fundamental cause of the assertion failure was that a function call was not recorded, but the return was. import sys
from tensorflow import DType
from tensorflow.core.framework import types_pb2
def trace(frame, event, arg):
if event == "call" or event == "c_call":
print(event, frame, frame.f_code)
elif event == "return" or event == "c_return" or event == "c_exception":
print(event, frame, frame.f_code)
sys.setprofile(trace)
DType(types_pb2.DT_RESOURCE)
sys.setprofile(None) Will generate (subtract warnings) call <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 73, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
c_return <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 74, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
return <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 81, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
c_call <frame at 0x7fecd70d9900, file '/home/gaogaotiantian/programs/bilibili_video/scrabble.py', line 33, code <module>> <code object <module> at 0x211a330, file "/home/gaogaotiantian/programs/bilibili_video/scrabble.py", line 1> A The reason is that, it's a You can catch the relevant events with import sys
from tensorflow import DType
from tensorflow.core.framework import types_pb2
E = sys.monitoring.events
sys.monitoring.use_tool_id(0, "test")
def call_callback(code, instruction_offset, callable, arg0):
print("call", code, instruction_offset, callable, arg0)
def return_callback(code, instruction_offset, callable, arg0):
print("return", code, instruction_offset, callable, arg0)
sys.monitoring.register_callback(0, E.CALL, call_callback)
sys.monitoring.register_callback(0, E.C_RETURN, return_callback)
sys.monitoring.set_events(0, E.CALL)
DType(types_pb2.DT_RESOURCE)
sys.monitoring.set_events(0, E.NO_EVENTS)
Notice the
and it's correspoding
They have different callables - because Even though we kind of knew the cause of the issue, fixing it is not trivial. The most important thing - we can easily repro this without 3rd party libraries, because CPython does not do this anymore. I don't know the exact next step, so maybe @markshannon can give some suggestions? |
I found a pure CPython repro and I'll try to fix this. |
@gaogaotiantian Thanks for doing the triaging and analysis. We definitely shouldn't be claiming we are calling one object and then calling another. Maybe would should apply your fix to 3.12 and try the more intrusive fix for 3.13 onward? |
Right, so there are multiple conflicts in this issue.
Overall, I believe we need clear rules for both
|
I don't think it matters that much. These two are basically the same thing:
What is important is that the events match what actually happens. If we make sure that Generally, we want to unpack bound methods first, because it keeps the C stack use consistent regardless of how functions are called. That way a slight change in a call doesn't blow the stack. |
#122177 handles the @gaogaotiantian do you want to backport #122072 to 3.13 and 3.12 and we'll do the more intrusive fix for 3.14 onwards? |
Okay I can backport #122072 to 3.13 and 3.12. |
…c function (GH-122072) Log call events in sys.setprofile when it is a method with a C function.
… with c function (pythonGH-122072) Log call events in sys.setprofile when it is a method with a C function. (cherry picked from commit e91ef13) Co-authored-by: Tian Gao <gaogaotiantian@hotmail.com>
… with c function (pythonGH-122072) Log call events in sys.setprofile when it is a method with a C function. (cherry picked from commit e91ef13) Co-authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Oops, Greg merged the PR. I thought Mark meant to rebase the change to 3.13 and leave 3.14 alone so I left it there to work on it later. I accidentally request a lot of reviews because I naively tried to re-target the PR to 3.13. However, I don't think it's that bad. We can still have the more reasonable 3.14 fix in the interpreter instead of in |
… with c function (pythonGH-122072) Log call events in sys.setprofile when it is a method with a C function.
… with c function (pythonGH-122072) Log call events in sys.setprofile when it is a method with a C function.
…or is consistent with CALL (GH-122177)
Bug report
Bug description:
Lately, I've been testing IA code on various Python interpreters and their corresponding profilers across multiple platforms. After multiple attempts, I've noticed that CPython profilers consistently fail to analyze the following code.
I've been testing an IA code on different Python interpreters and their respective profilers across multiple platforms. Specifically, I've been working with a simple TensorFlow+Keras code that classifies number image inputs. Interestingly, I found that the code works well with IntelPython, which uses Python 3.9.19 as its latest version. When I tested the code on multiple versions of CPython, I noticed that the profiler works well and returns information for CPython versions less than 3.12.1. However, since CPython 3.12.1, the code crashes with an error.
I ran the code on my laptop, which has a Tiger Lake architecture and no NVIDIA GPU or Tensor Cores. As I didn't recompile the TensorFlow library. Therefore, it's expected to see warning messages related to the lack of AVX512 and GPU acceleration.
Test Environnement:
WSL2 Ubuntu 20.04
- Python 3.12.4
Manjaro
- Python 3.9.19
- PYthon 3.10.14
- Python 3.11.9
- Python 3.12.0
- Python 3.12.1
- Python 3.12.2
- Python 3.12.3
- Python 3.12.4
Fedora
- Python 3.12.4
For Python versions prior to 3.12.1, I only received warning messages, and the profiler worked as expected. However, since upgrading to 3.12.1, I've started encountering AssertError issues. Interestingly, I've compared the profile.py file between versions 3.12.0 and 3.12.1, and they appear to be identical. It's possible that the introduction of PEP 695 in Python 3.12 is causing this occasional error.
While waiting for your response, I wish you a good day.
Aaron SU
CPython versions tested on:
3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs
CALL
#122177The text was updated successfully, but these errors were encountered: