Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Add Span filling for frontends to Relay #9723

Merged
merged 5 commits into from
Dec 28, 2021

Conversation

chunit-quic
Copy link
Contributor

@chunit-quic chunit-quic commented Dec 13, 2021

  • Add a common span filling feature for tf1/2, tflite and pytorch.
  • Add test cases for Span filling in each frontend.
  • Expose Tuple and TupleGetItem span to python end

Hi community,

Here is a pull request about span filling for frontends -> relay
(frontedns: TF 1 and 2, tfltie and pytorch)
This feature could help users to track the conversion more precisely
I would like to descript more about how it works and current status below. :D

  1. One to many conversion
    First, though there is a set_span function for tensorflow and tensorflow2, some spans are still missing from time to time.
    One of the reasons is that an op conversion might be an one to many conversion.
    In this situation the intermediate ops result in empty string.
    Take the pack conversion for example, there are several expand_dims ops might be added before concatenate.
    Via adding a ExprMutator to traverse expression each time after an op is converted, we should get a full span tagged RelayIR.

Here gives a simple example:
Before modification, the test case in this patch (tensorflow/test_forward.py:320) is converted to the following Relay expressions

def @main(%input: Tensor[(?, ?, 3, 1), float32]) {
%113 = shape_of(%input, dtype="int32") /* Shape /;
%114 = strided_slice(%113, begin=[0], end=[1], strides=[1], axes=None);
%115 = squeeze(%114) /
strided_slice /;
%116 = expand_dims(%115, axis=0);
%117 = expand_dims(3, axis=0);
%118 = expand_dims(3, axis=0);
%119 = (%116, %117, %118);
%120 = concatenate(%119) /
stack /;
dyn.reshape(%input, %120, newshape=[]) /
output */
}

With this patch we can obtain the following format.

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
%0 = shape_of(%input, dtype="int32") /* Shape /;
%1 = strided_slice(%0, begin=[0], end=[1], strides=[1], axes=None) /
strided_slice_PART_0 /;
%2 = squeeze(%1) /
strided_slice /;
%3 = expand_dims(%2, axis=0) /
stack_PART_0 /;
%4 = expand_dims(3, axis=0) /
stack_PART_1 /;
%5 = expand_dims(3, axis=0) /
stack_PART_2 /;
%6 = (%3, %4, %5) /
stack_PART_3 /;
%7 = concatenate(%6) /
stack /;
dyn.reshape(%input, %7, newshape=[]) /
output */
}

(Thanks to @lixiaoquan's advice, keeping the span without suffix in the same position seems better :D)

  1. span naming for each frontend
    2.1. TensorFlow (1 and 2) naming: is kept the same.
    2.2. tflite naming: is a combination of its op position index and output tensor name(s).
    op position is good enough to map back the tflite.
    And the output tensor name should be helpful when user search the op in Netron
    2.3. Pytorch naming: Because PyTorch provides two kinds of graph, jit._trace.TopLevelTracedModule, and _C.Graph, two key attributes, kind() and scopeName() are recorded in a span.
    scopeName(), is used to map a Realy expression back to its original pytorch module part in jit._trace.TopLevelTracedModule, and _C.Graph.
    Combined with kind(), the position of node can be precisely located in _C.Graph.

  2. Limitation
    3.1. Few model in test_functional_models.py is still in investigation.
    3.2. In the end of tflie conversion, a Tuple expression is added if output more than 1. This tuple will not have any span.
    3.2 Note that some conversion, like aten::to in Pytorch might result in a python-built-in float instance. its node information will be drop simply.

  3. Trivial
    Several test cases are attached. Should be a quick verifier for reviewing.

Thank you for reading. Any comment is appreciated. :)

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

* Add a common span filling feature for tf1/2, tflite and pytorch.
* Add test case for Span filling in each frontend.
* Expose Tuple and TupleGetItem to python end
@lixiaoquan
Copy link
Contributor

I just feel that in one to many case, the original tensor/layer name should be attached to the last node in the group. Because that's where the computational result (and tensor type) matches between original graph and relay IR. And we may need to find the last node of a group frequently, keep original layer's name there can make it easier.

For example, a LSTM can become thousands of nodes in relay IR. If its name is not attached to the last, we'll have to try to search layer_DERIVED_xxxx many times to find the end.

def @main(%input: Tensor[(?, ?, 3, 1), float32]) {
%10 = shape_of(%input, dtype="int32") /* Shape /;
%11 = strided_slice(%10, begin=[0], end=[1], strides=[1], axes=None) / strided_slice_PART_0) /;
%12 = squeeze(%11) / strided_slice /;
%13 = expand_dims(%12, axis=0) / stack_PART_0 /;
%14 = expand_dims(3, axis=0) / stack_PART_1 /;
%15 = expand_dims(3, axis=0) / stack_PART_2 /;
%16 = (%13, %14, %15) / stack_PART_3 /;
%17 = concatenate(%16) / stack /;
dyn.reshape(%input, %17, newshape=[]) / output */
}

@chunit-quic
Copy link
Contributor Author

chunit-quic commented Dec 13, 2021

Hi @lixiaoquan,

It's a good advice and easy to implement. :D
If tag the original name to the final expression is a better philosophy for users, I can change to this way after collecting more other comments from reviewers.

* Fix lint errors
* Change default string of scope_part in Pytorch
* Reorder the span position for one to many conversion
Copy link
Contributor

@mbs-octoml mbs-octoml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks for this! We are gradually improving how we flow spans through passes so that all this hard-won debug info is not immediately lost, so your work will pay even more dividends in the future.

Just some nits.

class SpanFiller(ExprMutator):
"""SpanFiller"""

def __init__(self, node_name, surfix_str="_PART_"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suffix_str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -389,12 +389,21 @@ Doc RelayTextPrinter::VisitExpr_(const TupleNode* op) {
if (op->fields.size() == 1) {
doc << ",";
}
return doc << ")";
doc << ")";
if (op->span.defined()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:: can you leave a warning comment that we'll probably need to protect this by some kind of 'include_spans' or 'verbose' printer flag. But at this stage I'm happy to have them all!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be up for doing the span suffix printing in the VisitExpr override? I think might as well do it for all the node types uniformly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be replaced by my different printer that I described to you the other day iirc. I think we should bring back the old printing mode via an implementation of a Renderer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbs-octoml
Thank you for reviewing and giving this PR a positive feedback! :D

About comments in this part, please kindly correct me if I am wrong.

nit:: can you leave a warning comment that we'll probably need to protect this by some kind of 'include_spans' or 'verbose' printer flag

If I didn't misunderstand it. It will be nice to have a flag to control span printing (After all some time it will be super long.)
In the latest commit I add a bool flag with true as default value to control it. (src/printer/text_printer.h:116)
Although a "/* */" is left based on this implementation... is It basically what we want?

would you be up for doing the span suffix printing in the VisitExpr override?

About this part, do you mean how about adding print span for those printer without it currently?
Like, ScalarLiteral, PrintExpr and VisitExpr_(const IfNode* op) ...
If so, I did try to browse them at first. Yet it seems that it is not easy to track and verify them comprehensively barely by a glance. Since sometime we might even need to check their c++ global register and python end.
If is fine to you perhaps one more PR for this enhancement would be better? I could also try to think about which test cases could help me to check those kind of printers. :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jroesch

It seems that there is a more suitable printer for this part.
Would you mind to share that feature with me? Sorry that I just follow the existing format without checking it more carefully.
Once we have conclusion about which one should be used this time, I could try to modify my current code. :D

 * nit fixed
 * Add a bool flag to control print span
 * refactor pytorch get span to a birefer way
* Add one more condition for spanFller
* Refine the format for those pytorch node without scopeName
@chunit-quic
Copy link
Contributor Author

chunit-quic commented Dec 19, 2021

Hi @mbs-octoml, @jroesch,

Just a gentle ping. Should I modify something more or would it be fine to be merged? Thanks :)

@chunit-quic
Copy link
Contributor Author

Hi @mbs-octoml, @jroesch,
Hope you guys have a great vacation! And just a gentle ping again. :D

Hi @masahi,
I found you merged a lots of PRs recently in closed PR section. Would you mind to take a look at this PR also please? :)

@FrozenGene
Copy link
Member

I like this pr, which could make us support heterogeneous execution more fine-grained . Thanks @chunit-quic

@FrozenGene
Copy link
Member

As @mbs-octoml approved, ideally we could merge it now. However, we have one unresolved comment, @chunit-quic do you want to file a new pr or do you want to resolve it in this pr?

@chunit-quic
Copy link
Contributor Author

chunit-quic commented Dec 28, 2021

Hi @FrozenGene,
Thanks for your positive feedback! :D

However, we have one unresolved comment,

Currently I would prefer to merge this one first if possible.
The unresolved part you means should be the discussion with @jroesch and @mbs-octoml in src/printer/relay_text_printer.cc.
Since I haven't got further replies from them and to the best of my knowledge it might be good to have one more PR to reach (or modify) what we want. Like the formal printer, or collect and expose those spans which are hidden now.

@FrozenGene FrozenGene merged commit ce108c1 into apache:main Dec 28, 2021
@FrozenGene
Copy link
Member

Thanks @chunit-quic @mbs-octoml @jroesch It is merged now. @chunit-quic Let us make new PRs to solve left discussion.

@chunit-quic
Copy link
Contributor Author

Thanks @FrozenGene. Sure thing! :D
Once we get some more precise information from @mbs-octoml and @jroesch, we can start these works.

@mbs-octoml
Copy link
Contributor

Hi @chunit-quic, sorry the line went dead over the break, glad to see this merged. I don't have any outstanding requests for you.

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* [Frontend] Add Span filling for frontends to Relay

* Add a common span filling feature for tf1/2, tflite and pytorch.
* Add test case for Span filling in each frontend.
* Expose Tuple and TupleGetItem to python end

* [Frontend] Add Span filling for frontends to Relay

* Fix lint errors
* Change default string of scope_part in Pytorch
* Reorder the span position for one to many conversion

* [Frontend] Add Span filling for frontends to Relay

 * nit fixed
 * Add a bool flag to control print span
 * refactor pytorch get span to a birefer way

* [Frontend] Add Span filling for frontends to Relay

* Add one more condition for spanFller
* Refine the format for those pytorch node without scopeName

* [Frontend] Add Span filling for frontends to Relay

* Fix lint
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* [Frontend] Add Span filling for frontends to Relay

* Add a common span filling feature for tf1/2, tflite and pytorch.
* Add test case for Span filling in each frontend.
* Expose Tuple and TupleGetItem to python end

* [Frontend] Add Span filling for frontends to Relay

* Fix lint errors
* Change default string of scope_part in Pytorch
* Reorder the span position for one to many conversion

* [Frontend] Add Span filling for frontends to Relay

 * nit fixed
 * Add a bool flag to control print span
 * refactor pytorch get span to a birefer way

* [Frontend] Add Span filling for frontends to Relay

* Add one more condition for spanFller
* Refine the format for those pytorch node without scopeName

* [Frontend] Add Span filling for frontends to Relay

* Fix lint
@rebel-shshin
Copy link
Contributor

rebel-shshin commented Jan 25, 2022

@chunit-quic @FrozenGene @mbs-octoml
Hi guys, I think I found a bug when convert pytorch LSTM layer to relay graph.
LSTM layer appears twice in the converted relay graph even if I have only one LSTM layer.
I found this weird behavior solved, if I commented the following two lines in python/tvm/relay/frontend/pytorch.py
span_str, empty_counter = self._get_torch_span(op_node, empty_counter)
relay_out = set_span(relay_out, span_str)

Do you have any idea??

@chunit-quic
Copy link
Contributor Author

Hi, @rebel-shshin
I'm surprised by the lstm case.
After checking the file test_lstm.py in the pytorch folder, the output IR graph does change with set_span mutator.

Hi, @FrozenGene @mbs-octoml
Since it might take me a while to investigate it, perhaps it would be better to revert it? I am preparing a PR to revert the whole change of this PR. I will submit the change if reversion is OK to you guys.
Thank you!

@FrozenGene
Copy link
Member

Hi, @rebel-shshin I'm surprised by the lstm case. After checking the file test_lstm.py in the pytorch folder, the output IR graph does change with set_span mutator.

Hi, @FrozenGene @mbs-octoml Since it might take me a while to investigate it, perhaps it would be better to revert it? I am preparing a PR to revert the whole change of this PR. I will submit the change if reversion is OK to you guys. Thank you!

OK. Please submit reverted pr.

chunit-quic pushed a commit to chunit-quic/tvm that referenced this pull request Jan 26, 2022
…)"

Because of the failure the LSTM conversion from Pyotrch
This reverts commit ce108c1.
chunit-quic pushed a commit to chunit-quic/tvm that referenced this pull request Jan 26, 2022
…)"

Because of the failure the LSTM conversion from Pytorch
This reverts commit ce108c1.
chunit-quic pushed a commit to chunit-quic/tvm that referenced this pull request Jan 26, 2022
…)"

Because of the failure of LSTM conversion from Pytorch
This reverts commit ce108c1.
@chunit-quic
Copy link
Contributor Author

chunit-quic commented Jan 26, 2022

Thank you @FrozenGene, for your reference here is the reversion PR. :)
#10072

@chunit-quic
Copy link
Contributor Author

chunit-quic commented Jan 26, 2022

Hi @rebel-shshin,
Pardon that I forgot to confirm with you about the details.
So the following snapshot is what I get from the single LSTM layer result from test_lstm.py.
image
In the left-hand side with span filling, four more expressions pop out:
Two more tuples (%36, %37) appear in the while loop, and a Nil (%44) followed by (%45 = %39(0, %44, %states, %input)), which is the LSTM body.

Is it the same as what you get? If not would you mind to share what's your model and conversion file with me? Thank you. :)

Mousius pushed a commit that referenced this pull request Jan 26, 2022
…10072)

Because of the failure of LSTM conversion from Pytorch
@rebel-shshin
Copy link
Contributor

rebel-shshin commented Jan 27, 2022

Hi @chunit-quic, thanks for the fast reaction.

My case is bit different with yours. My network has a single LSTM layer with some dense layers before and after the LSTM. The model returns three outputs which are output, cell, and hidden state of the LSTM. However, the converted relay graph has two LSTM layers. The first LSTM stands for the output and the second one stands for the cell and hidden state. This is really weird because all of them should be generated from the same LSTM layer.

@chunit-quic
Copy link
Contributor Author

Hi @rebel-shshin,

Thank you for the detailed information. I found some clues yet still need some time to spot the problem preciously. I will try to make a test case similar to yours and give it a try. :D

sunggg pushed a commit to sunggg/tvm that referenced this pull request Jan 29, 2022
…)" (apache#10072)

Because of the failure of LSTM conversion from Pytorch
ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022
…)" (apache#10072)

Because of the failure of LSTM conversion from Pytorch
ghost pushed a commit to neo-ai/tvm that referenced this pull request Feb 21, 2022
…)" (apache#10072) (#246)

Because of the failure of LSTM conversion from Pytorch

Co-authored-by: Chun-I Tsai <quic_chunit@quicinc.com>
ghost pushed a commit to neo-ai/tvm that referenced this pull request Feb 21, 2022
…)" (apache#10072) (#246)

Because of the failure of LSTM conversion from Pytorch

Co-authored-by: Chun-I Tsai <quic_chunit@quicinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants