Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[ENV] update runtime setting default values #18987

Merged
merged 1 commit into from
Sep 7, 2020
Merged

Conversation

szha
Copy link
Member

@szha szha commented Aug 23, 2020

Description

update runtime setting default values for resource copies, mem pool type

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @szha , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, windows-gpu, centos-gpu, unix-cpu, sanity, windows-cpu, miscellaneous, clang, website, edge, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@szha szha marked this pull request as ready for review August 24, 2020 01:30
@szha
Copy link
Member Author

szha commented Aug 24, 2020

@sxjscience @zhreshold it would be great to have some data on how such defaults work for the cv and nlp models.

src/resource.cc Show resolved Hide resolved
if (type == nullptr)
type = "Naive"; // default pool
if (type == nullptr) {
type = "Round"; // default pool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's obvious that round can help speed up certain dynamic input workloads, but tends to oom more frequently. I suggest we much very cautious about changing the default to round, unless there's a good fallback for oom handling.

Copy link
Member Author

@szha szha Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the concern on this. This change will impact those static-size static-graph models that were at the boundary of GPU memory limit. Which of the current GluonCV model training scripts fall in this category?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope is of course to provide a good out-of-the-box usage experience to mxnet users. From what I observed, there seems to be more models with dynamic shape inputs than static ones, and many of the static-shape models can still run in this setting, hence the proposal.

@sxjscience
Copy link
Member

sxjscience commented Aug 25, 2020 via email

@szha
Copy link
Member Author

szha commented Aug 25, 2020

I think CNNs are generally static shape while models in NLP are generally dynamic shape.

I don't think we can generalize like this. For example, object detection and segmentation are based on CNN and are usually not static-shaped.

Do we have any plan for improving the memory usage?

Of course we do. I think @ArmageddonKnight is currently fixing some missed allocation entries in memory profiler, and plans on developing a memory usage visualization tool later this week to help narrow down the focus for memory optimization. We also intend to add mirror option to cached op to allow training for larger model.

Copy link
Member

@sxjscience sxjscience left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can approve from the NLP side. Because the workloads in NLP are usually dynamic so it's beneficial to have round memory management.

@szha
Copy link
Member Author

szha commented Aug 27, 2020

thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?

@zhreshold
Copy link
Member

thanks for the reviews. @zhreshold how about setting the pool strategy to be naive when shapes are static?

This is actually fine when shapes are static, my major concern is that with round enabled by default, in most use cases mxnet can be faster but consumes more memory than expected

@szha
Copy link
Member Author

szha commented Aug 29, 2020

@zhreshold we could consider adding an interface to allocate the exact size and use it in cached op for static shape only.

@szha
Copy link
Member Author

szha commented Aug 30, 2020

I reverted the round pool change first to merge the rest of the changes. I will work on a cached op path to enable exact size allocation to avoid the memory waste in the static graph case.

@szha szha force-pushed the env_defaults branch 2 times, most recently from ff5a940 to 9267d11 Compare September 6, 2020 05:40
@szha szha merged commit 04e394a into apache:master Sep 7, 2020
@szha szha deleted the env_defaults branch September 7, 2020 02:39
@szha szha restored the env_defaults branch September 7, 2020 02:39
@szha szha deleted the env_defaults branch September 7, 2020 02:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants