Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Graph Partition API #15886

Merged
merged 24 commits into from
Sep 3, 2019
Merged

Graph Partition API #15886

merged 24 commits into from
Sep 3, 2019

Conversation

mseth10
Copy link
Contributor

@mseth10 mseth10 commented Aug 14, 2019

Description

This PR introduces enhanced Python and C APIs to trigger graph partitioning for a given backend. These APIs provide an option to the user to infer shapes, dtypes and stypes before partitioning. They add the ability to pass optional arguments to subgraph property.

This PR makes the following modifications to subgraph property:

  • Added the ability for the subgraph property to preprocess the graph and initialize any custom state that is required during partitioning
  • Added the ability for the subgraph property to post-process the graph and do things like collect statistics on the quality of partitioning.
  • Enabled subgraph property to reject creating a sub-optimal subgraph. For example, the subgraph property could reject subgraphs with only one operator in it.

We reuse the existing graph partitioning algorithm and control using subgraph property classes. Since we’re modifying a symbol object, we add this API to the Symbol class. Users will call optimize_for on an existing symbol object to create a new partitioned symbol.

Here is an example showing partitioning without args, which means infer shape/type will NOT be called before partitioning. It is similar to GetBackendSymbol.

import mxnet as mx
sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)
sym = sym.optimize_for('default')
exe = sym.bind(ctx=mx.cpu(), args=args, aux_states=aux, grad_req='null')

Here is an example showing partitioning with args, which means infer shape/type is called before partitioning.

import mxnet as mx
sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)
sym = sym.optimize_for('default', args=args, ctx=mx.cpu())
exe = sym.bind(ctx=mx.cpu(), args=args, aux_states=aux, grad_req='null')

Here is an example showing partitioning with kwargs to set a list of ops to be excluded from subgraphs.

import mxnet as mx
sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)
sym = sym.optimize_for('default', args=args, ctx=mx.cpu(), excluded_ops=['BatchNorm'])
exe = sym.bind(ctx=mx.cpu(), args=args, aux_states=aux, grad_req='null')

Next steps:

  • Add example(s) and test(s) for excluded_ops passed as arguments to optimize_for as part of the PR on support for custom subgraph properties
  • Add functionality for users to pass shape/type/stype dicts as kwargs to optimize_for just like we have it in simple_bind

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@mseth10 mseth10 requested a review from szha as a code owner August 14, 2019 01:55
@pengzhao-intel
Copy link
Contributor

@ZhennanQin for review the compatible :)

@ZhennanQin
Copy link
Contributor

What's the difference between sym.partition('default') and sym.get_backend_symbol('default')?

@mseth10 mseth10 force-pushed the partition_api branch 2 times, most recently from b6b5147 to 478c60a Compare August 14, 2019 23:47
@mseth10 mseth10 force-pushed the partition_api branch 3 times, most recently from 11944af to 3860dee Compare August 15, 2019 01:43
@samskalicky
Copy link
Contributor

samskalicky commented Aug 15, 2019

@ZhennanQin Here in this PR we're trying to do the following things:

  1. Clarify the API/function-names. Initially we wanted the API to be called "sym.partition" since we were going to partition the symbol. But given your latest subgraph API enhancements that enable multiple subgraph properties to be specified (as a subgraph backend) we dropped this idea. Instead, to users the outcome is that they are "optimizing the symbol, for a specific backend". So now we propose the API to be called "optimizeFor" and it will take the name of the backend. So this way it will be much more understandable to users that they are "optimizing the symbol for a specific backend". For example: sym.optimizeFor("TensorRT") or sym.optimizeFor("MKLDNN") or sym.optimizeFor("EIA"). We expect that users could call this more than once for the same model. For example, users could first optimize for EIA (that would group supported nodes into subgraphs) and then optimize for MKLDNN for the remaining nodes that were not partitioned into subgraphs.

  2. We also want to enable users to pass more configuration options directly to the subgraph property to tailor the partitioning. One example is to provide a better experience for users to provide a blacklist (or list of operators to exclude from subgraphs). Another could be setting the minimum number of nodes allowed in a subgraph (ie. 3).

  3. Enable the API to optionally perform shape/type propagation before partitioning. For example, some backends may only support operators if their inputs are a particular type (ie. FP16 or Int8) or shape (2D, 3D, or 4D).

  4. Currently the subgraph properties are state-less. We want to enable the subgraph properties to have a prepartition and postpartition APIs to do setup or cleanup around the actual partitioning call. This could be used to setup a whitelist of supported operators, or pre-analyze the graph before partitioning begins and Select is called on the Selector.

In this PR we want to enable this functionality for users to more easily use partitioning. In a later PR we want to enable subgraph properties to be dynamically loaded from libraries.

@mseth10 mseth10 force-pushed the partition_api branch 2 times, most recently from 6fb556b to e1067da Compare August 16, 2019 02:15
@ZhennanQin
Copy link
Contributor

@samskalicky Thanks for the explanation, it helps a lot to understand this PR. Basically, I think sym.optimizeFor('MKLDNN') is the same as current sym.get_backend_symbol('MKLDNN'), so I don't see the reason to duplicate it. Maybe we can directly rename get_backend_symbol to optimizeFor?

@samskalicky
Copy link
Contributor

samskalicky commented Aug 16, 2019

@ZhennanQin Thanks for your comments, we'll try and clarify our thought process. Apologies for the lack of clarity in this PR. You are correct in the current status of the code that @mseth10 has committed so far today. The PR is still WIP and more changes are coming, lets not make decisions on what is there today.

Instead lets focus on the items i mentioned before (they will get into the PR description before we attempt to merge, I promise! :-D). We will add more arguments to the optimizeFor API in the coming commits. We need to add arguments to enable us to to do shape/type propagation prior to partitioning (so we can use shape/type info to select ops in subgraph properties), and we want to accept arbitrary options that we pass to the subgraph property for further configuration (ie. blacklisted ops).

These new API changes will not be compatible with the current get_backend_symbol API. Since MXNet maintains semantic versioning for minor releases we cannot change the API yet. So instead we'll create another API (optimizeFor) along side get_backend_symbol for now.

@samskalicky
Copy link
Contributor

samskalicky commented Aug 16, 2019

@mseth10 we also need to add the ability to reject creating a subgraph in the buildsubgraph.cc file:

https://github.com/apache/incubator-mxnet/blob/e98fea3165670157090f2a2f644890452443803c/src/operator/subgraph/build_subgraph.cc#L574

We want to enable the subgraph property to return a null node to say reject creating a subgraph. The partitioning pass has a decycle feature that may removed nodes that were selected when calling the select function on the subgraph property. So the subgraph may be smaller than anticipated, and we want the ability to not create subgraphs based on some criteria in the subgraph property (ie. subgraph size too small).

@mseth10 mseth10 force-pushed the partition_api branch 4 times, most recently from dfeeacf to 6082f41 Compare August 17, 2019 00:31
@ZhennanQin
Copy link
Contributor

@samskalicky @mseth10 As this PR is still WIP, please ping me when this PR is ready to review. I agree with you that we need a more powerful API to put dtype, shape and ctx into consideration to make partition more accurate.

@samskalicky
Copy link
Contributor

@ZhennanQin can you please approve? @mseth10 has added the "Next steps" in the PR description to save the outcome of our discussions from the PR comments.

Copy link
Contributor

@ZhennanQin ZhennanQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for contributing!

@samskalicky
Copy link
Contributor

@mxnet-label-bot remove [pr-awaiting-review]

@marcoabreu marcoabreu removed the pr-awaiting-review PR is waiting for code review label Sep 3, 2019
@samskalicky
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-merge]

@marcoabreu marcoabreu added the pr-awaiting-merge Review and CI is complete. Ready to Merge label Sep 3, 2019
Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great works on this :)

Merging now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants