`DynamicShapeDetector` with trie implementation. #7918

ysiraichi · 2024-08-27T23:05:02Z

This PR finishes the implementation started in #7817. In this PR, we implement the DynamicShapeDetector using a trie. It is used for detecting different traces (sequence of PyTorch operations called), which we are assuming is due to dynamic shapes.

How does it work?

Wrap a function with _XLAC._dynamic_shape_detector_start_session call in the beginning, and _XLAC._dynamic_shape_detector_end_session call in the end of a function
When the function is run, the detector will start keeping track of the created IR nodes
If, at the N-th call, we trace a different sequence of operations from the previous calls, we increment the number of traces
If the number of traces is greater-than max_allowed_traces_per_function, we raise an error

Implementation Details:

Build the trie incrementally, with the help of TrieBuilder (state of tracing)
At every traced operation, we update the TrieBuilder (similar to states of a DFA)
If, at any point, we have to update the actual TrieNode, it means this is part of a new trace (i.e. sequence of operations)

cc @miladm @JackCaoG

ysiraichi · 2024-08-28T18:06:30Z

@JackCaoG When I was writing the tests for this PR, I thought that torch_xla.compile could return a class instance (in a future PR), so that we can track the number of recorded traces for each function. What do you think?

ysiraichi · 2024-08-28T18:08:42Z

torch_xla/csrc/dynamic_shape_detector.cpp

+    // here, so that we can correctly return the builder to the root of the
+    // trie.
+    //
+    // TODO(ysiraichi): we should actually rollback this trace.


As is, it should work as we expect. However, a better approach (for a future PR) would be to rollback the changes introduced since the last session start.

if we hit a error likely the program will just end so maybe it is fine.

JackCaoG · 2024-08-28T18:13:38Z

Thanks I will try to take a look today.

JackCaoG · 2024-08-30T20:41:57Z

torch_xla/torch_xla.py

+    f: Optional[Callable] = None,
+    full_graph: Optional[bool] = False,
+    name: Optional[str] = None,
+    detect_dynamic_shape=False,


I think instead of a boolean value detect_dynamic_shape, we can have it to be something like max_dynamic_shape_graph_allowed, then you can map it to _dynamic_shape_detector_set_max_allowed_traces directly.

What do you think about allowed_traces (because we can have multiple graphs when tracing a function once)?

JackCaoG · 2024-08-30T20:43:15Z

torch_xla/torch_xla.py

@@ -125,6 +129,8 @@ def foo2(x):
    elif hasattr(f, '__str__'):
      name = f.__str__()

+  current_id = uuid.uuid4().__str__()


I thought about it a bit and find this is not very ideal. This way if we do

def f(): xxx torch.compile(f) torch.compile(f)

we will get 2 uuid. I think we should try to hash the passed in function pointer so we can dedup.

I think we should use function pointer if it is not None (currently when it is used as a decorator with @, the fn will be none I think).

JackCaoG · 2024-08-30T20:53:37Z

test/test_dynamic_shapes_detector.py

+    self._run_and_compare(foo, optfoo, args=(inp1,))
+
+    msg = """\
+torch_xla/csrc/dynamic_shape_detector.cpp:47 : Maximum number of different traces allowed per function exceeded: 1


cpp:47 is too specified, we can remove the line number and file name from the check to make it more general.

JackCaoG · 2024-08-30T21:01:04Z

torch_xla/csrc/dynamic_shape_detector.cpp

+      ostr << "  - " << pair.second->common_sequence_.front().str << std::endl;
+    }
+  }
+


I think we should also dump the current python trace, which would help people to figure out where to fix. Check https://github.com/pytorch/xla/blob/master/torch_xla/csrc/debug_util.cpp#L125-L131 for the example.

Here's an example running test_trace_limit_exceeded_common_sequence_mismatch:

Traceback (most recent call last): File "/usr/local/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/local/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/local/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "xla/test/test_dynamic_shapes_detector.py", line 104, in test_trace_limit_exceeded_common_sequence_mismatch self._run_and_compare(foo, args=(inp, 2), allowed_traces=allowed_traces) File "xla/test/test_dynamic_shapes_detector.py", line 20, in _run_and_compare optout = optf(*args) File "/usr/local/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "xla/test/test_dynamic_shapes_detector.py", line 93, in foo return x * 5 RuntimeError: torch_xla/csrc/dynamic_shape_detector.cpp:41 : Maximum number of different traces allowed per function exceeded: 1 Got: [] aten::mul, xla_shape=f32[10]{0}, dynamic_dims: () Expected: [] aten::add, xla_shape=f32[10]{0}, dynamic_dims: ()

We already have the python trace, since this check is being done incrementally, every time a new IR node is created.

Ok this is nice, I guess we can go a step further where for every TRIE node we also store the python stack trace for the current node. This way when we raise the runtime error, we can also show that here is the python stack we expects, but now we hit here. It would be easier to debug this way.

we can implement this as a follow up through.

JackCaoG

implementation lgtm, minor comments on the ux.

ysiraichi · 2024-08-31T20:58:05Z

Summary of the changes:

If a function is given, make its current_id dependent on its name and id
Add API for removing C++ session entries: DynamicShapeDetector::RemoveSessionIfExists
Keep track of the compiled functions that are still alive
- Different local-scoped functions with the same name and id may exist
Add allowed_traces optional parameter + documentation
Remove file and line information from the expected error messages

ysiraichi · 2024-09-03T18:52:10Z

@JackCaoG This PR is ready for another round of reviews. Could you take a look at it?

JackCaoG · 2024-09-03T19:00:50Z

torch_xla/torch_xla.py

+    f: Optional[Callable] = None,
+    full_graph: Optional[bool] = False,
+    name: Optional[str] = None,
+    allowed_traces: Optional[int] = None,


I think traces is too much of an implementation detail. Since we already have the full_graph above, how about num_different_graph_allowed?

Wouldn't that be kind of confusing, leading the user to think of the number of HLO graphs?

It is kind of true right? In here we are detecing how many different IR graph being traced and almost always translating to the number of different HLO graphs.

Ok. Couldn't think of something better, so went with your suggestion.

Hmm. That's not the case if we have a fallback operation in the middle, is it?

oh I see, I generally expect user to also turn on full_graph to True.. I guess trace is more correct but I just find it expose too much underlying implementation details.

JackCaoG · 2024-09-03T19:43:19Z

torch_xla/torch_xla.py

+      torch_xla._XLAC._dynamic_shape_detector_set_max_num_different_graphs_allowed(
+          num_different_graphs_allowed)
+      torch_xla._XLAC._dynamic_shape_detector_start_session(current_id)


I felt like it is better to regsiter the session with num_different_graphs_allowed outside of the _compile and in here we just need to start the session. We can do that in the follow up

JackCaoG and others added 3 commits August 20, 2024 12:38

Initial pr to support dynamic shape detection

9cdab08

Move DynamicShapeDetector + trie.

1eca64e

Add test.

0f2847f

ysiraichi requested a review from JackCaoG August 27, 2024 23:05

ysiraichi added 2 commits August 28, 2024 09:02

Fix lint issues.

679a119

Fix more lint issues.

89e19c7

ysiraichi commented Aug 28, 2024

View reviewed changes

JackCaoG reviewed Aug 30, 2024

View reviewed changes

ysiraichi added 3 commits August 31, 2024 17:49

Address reviews.

04cc976

Fix lint issues.

264b39c

Fix Python lint issues.

1b48ced

Revert test.

3cbecf0

ysiraichi mentioned this pull request Sep 3, 2024

Failing Torchbench Models: tracking issue #5932

Open

miladm assigned ysiraichi Sep 3, 2024

miladm added the dynamism Dynamic Shape Features label Sep 3, 2024

JackCaoG reviewed Sep 3, 2024

View reviewed changes

Change parameter name.

7618f5d

JackCaoG reviewed Sep 3, 2024

View reviewed changes

JackCaoG approved these changes Sep 3, 2024

View reviewed changes

ysiraichi merged commit 400bd91 into master Sep 4, 2024
27 checks passed

JackCaoG mentioned this pull request Sep 30, 2024

Initial pr to support dynamic shape detection #7817

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DynamicShapeDetector` with trie implementation. #7918

`DynamicShapeDetector` with trie implementation. #7918

ysiraichi commented Aug 27, 2024

ysiraichi commented Aug 28, 2024

ysiraichi Aug 28, 2024

JackCaoG Aug 30, 2024

JackCaoG commented Aug 28, 2024

JackCaoG Aug 30, 2024

ysiraichi Aug 31, 2024

JackCaoG Aug 30, 2024

JackCaoG Aug 30, 2024

JackCaoG Aug 30, 2024

JackCaoG Aug 30, 2024

ysiraichi Aug 31, 2024

JackCaoG Sep 3, 2024

JackCaoG left a comment

ysiraichi commented Aug 31, 2024

ysiraichi commented Sep 3, 2024

JackCaoG Sep 3, 2024

ysiraichi Sep 3, 2024

JackCaoG Sep 3, 2024

ysiraichi Sep 3, 2024

ysiraichi Sep 3, 2024

JackCaoG Sep 3, 2024

JackCaoG Sep 3, 2024

DynamicShapeDetector with trie implementation. #7918

DynamicShapeDetector with trie implementation. #7918

Conversation

ysiraichi commented Aug 27, 2024

ysiraichi commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackCaoG left a comment

Choose a reason for hiding this comment

ysiraichi commented Aug 31, 2024

ysiraichi commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`DynamicShapeDetector` with trie implementation. #7918

`DynamicShapeDetector` with trie implementation. #7918