Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tuning] Allow multiprocessing spawn to work (on macOS llvm at least) #8363

Merged
merged 9 commits into from
Jul 1, 2021

Conversation

AndrewZhaoLuo
Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo commented Jun 28, 2021

With this we can use the default (and safer) multiprocessing method "spawn" for macOS and Windows rather than "fork" for tuning. More details here: https://discuss.tvm.apache.org/t/why-is-auto-tuning-with-resnet-failing-at-task-1/10316/12

cc @hogepodge @tkonolige @leandron

@AndrewZhaoLuo
Copy link
Contributor Author

AndrewZhaoLuo commented Jun 28, 2021

Still a draft, there still is a problem with src/target/target.cc construction.

Target::Target(Target target, Target host) {
  ObjectPtr<TargetNode> n = make_object<TargetNode>(*target.get());
  CHECK(!n->host.defined() || n->host.same_as(host))
      << "ValueError: Adding a host to a target whose host field has been defined target: "
      << n->host << " host: " << host << " ptr target: " << n.get()
      << " ptr host: " << make_object<TargetNode>(*host.get()).get();
  ;
  // add target host into host field
  n->host = std::move(host);
  data_ = std::move(n);
}

Check failed
Spawn breaks pointer equality which was the assumption. We now need a deep equality thing for "tvm::Target" I think.

@junrushao
Copy link
Member

Does it work if we use Target.export for serialization? This method converts a Target to a JSON-like dict and should preserve all the information.

Equality check is another big issue. I discussed with @comaniac a while ago, but haven’t got a conclusion yet when should two targets are considered “equal”: If one target has -libs=cudnn and the other doesn’t, are they equal to each other?

@comaniac
Copy link
Contributor

Equality check is another big issue. I discussed with @comaniac a while ago, but haven’t got a conclusion yet when should two targets are considered “equal”: If one target has -libs=cudnn and the other doesn’t, are they equal to each other?

Exactly. It seems to me that we ultimately need two APIs for target: one (i.e., ==) checks if two targets are exactly the same, and the other (.compatible(self, other)) checks if target A is compatible to target B. The problem is the definition of "compatible" targets. In my own experience, compatible is much more useful than ==, as in many cases, people care more about whether a model/schedule built with target A can be used in target B. Meanwhile, we can still have the equality check first for internal use cases like this one I guess.

@AndrewZhaoLuo
Copy link
Contributor Author

Hmm I don't understand a lot of the host target stuff. In this case I believe we need equality to be exact equality.

The problem you are describing could be another form of equality?

In any case me and @tkonolige might just turn off the check for now.

@AndrewZhaoLuo
Copy link
Contributor Author

@zxybazh thoughts with turning off this check?

@zxybazh
Copy link
Member

zxybazh commented Jun 29, 2021

Hi Andrew, just curious about the context, why in this case would we add a target host to a target object that already has a different host?

@AndrewZhaoLuo
Copy link
Contributor Author

AndrewZhaoLuo commented Jun 29, 2021

Hmm, I'll be honest, I don't quite understand the target/host part of tvm very well. I was hoping you could give context on this since you were the last person on git to touch the line. Specifically the proper usage of the commented out check.

This method appears in a lot of places https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L171. And the problematic line specifically is this one: https://github.com/apache/tvm/blob/main/python/tvm/target/target.py#L200

Before it worked since a majority of tested systems used fork() as the method for multiprocessing. macOS and windows by default use a different process which is causing this error. If I had to guess why this is the case, it is because macOS and windows serialize and deserialize arguments to a process which breaks the pointer equality assumption in the 2 arg constructor.

@tkonolige
Copy link
Contributor

@zxybazh The target host and the new host are functionally the same, but not the same object.

@zxybazh
Copy link
Member

zxybazh commented Jun 29, 2021

OK, in that case, is it possible that you explicitly clear the host field of the given target object and then construct it this way? Because fork seem to be a special case, and generally the host should always be consistent.

@AndrewZhaoLuo
Copy link
Contributor Author

OK, in that case, is it possible that you explicitly clear the host field of the given target object and then construct it this way? Because fork seem to be a special case, and generally the host should always be consistent.

Not from the python end it looks like.

@zxybazh
Copy link
Member

zxybazh commented Jun 29, 2021

OK, in that case, is it possible that you explicitly clear the host field of the given target object and then construct it this way? Because fork seem to be a special case, and generally the host should always be consistent.

Not from the python end it looks like.

Would target.host = None work?

@AndrewZhaoLuo AndrewZhaoLuo marked this pull request as ready for review June 29, 2021 21:01
@AndrewZhaoLuo
Copy link
Contributor Author

Ok folks ready for review. This works with llvm and metal on m1 mac with "spawn" multiprocessing enabled. I also added a test to make sure other multiprocessing tests work.

The key assumption is when invoking the Target::Target(Target target, Target host) constructor it's ok if target->host and host refer to the same underlying target. This appears to be the intention with the line n->host.same_as(host).

However spawn() breaks any pointer equality so we need to have some sort of deep equals. This does not exist so for now I'm just comparing the human readable strings instead.

@AndrewZhaoLuo
Copy link
Contributor Author

OK, in that case, is it possible that you explicitly clear the host field of the given target object and then construct it this way? Because fork seem to be a special case, and generally the host should always be consistent.

Not from the python end it looks like.

Would target.host = None work?

It does not actually modify the underlying object.

@AndrewZhaoLuo
Copy link
Contributor Author

I chatted offline with @zxybazh about this and he says the check is not needed.

Furthermore, the entire target system is going to revamped relatively soon.

Therefore I have elected to just remove the check making this PR hopefully less controversial.

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me in terms of Target-related updates

@junrushao
Copy link
Member

Hey @AndrewZhaoLuo, would you like to briefly summarize the lessons here on properly supporting macOS?

@AndrewZhaoLuo
Copy link
Contributor Author

AndrewZhaoLuo commented Jun 30, 2021

The main lesson here is if we use python we must use multiprocessing to get parallelism. If we use multiprocessing we have to assume sometimes we cannot use a direct fork() to get a subprocess.

There might be other bugs as if do not have a direct fork() as arguments to the children process need to be serialized and reserialized. This has two implications:

  • Pointer equality assumptions are broken -- even if two objects shared a pointer to a resource before they might not after serialization and reserialization. This one is annoying.
  • Things passed to subprocesses must be pickleable which they might not be. The solution is to make things pickleable like I did here by removing the anonymous function closure.

This also has implications for Windows which does not use the fork() - exec() model.

@masahi masahi merged commit 578f617 into apache:main Jul 1, 2021
@junrushao
Copy link
Member

Thank you @AndrewZhaoLuo for the awesome work!

lygztq pushed a commit to lygztq/tvm that referenced this pull request Jul 1, 2021
…apache#8363)

* go to callable class

* add some documentation and naming

* extend comment

* manually do logic to avoid bug with pointer comparison

* revert changes to light change, correct comment'

* more principled change, but also kind of hacky

* test other tuning methods

* remove check;

* jostle CI
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
…apache#8363)

* go to callable class

* add some documentation and naming

* extend comment

* manually do logic to avoid bug with pointer comparison

* revert changes to light change, correct comment'

* more principled change, but also kind of hacky

* test other tuning methods

* remove check;

* jostle CI
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
…apache#8363)

* go to callable class

* add some documentation and naming

* extend comment

* manually do logic to avoid bug with pointer comparison

* revert changes to light change, correct comment'

* more principled change, but also kind of hacky

* test other tuning methods

* remove check;

* jostle CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants