Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add knob (IGNITE_DISABLE_DISTRIBUTED_METRICS=1) to disable distributed metrics reduction #2895

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iXce
Copy link

@iXce iXce commented Mar 16, 2023

This is useful for setups where distributed training is used, but evaluation is only performed on a single node (or independently over multiple nodes).

Description: In some setups one may want to have two different fleets for distributed training and (distributed) validation, however ignite currently conflates the two world sizes. Using this PR as a RFC before adding tests and extra documentation (if needed).

Check list:

  • [?] New tests are added (if a new feature is added)
  • [?] New doc strings: description and/or example code are in RST format
  • [?] Documentation is updated (if required)

…d metrics reduction

This is useful for setups where distributed training is used, but
evaluation is only performed on a single node (or independently over
multiple nodes).
@github-actions github-actions bot added module: contrib Contrib module module: distributed Distributed module module: metrics Metrics module labels Mar 16, 2023
@sadra-barikbin
Copy link
Collaborator

Thanks @iXce for the suggestion. We could also add an boolean attribute to Metric which all metrics descend from, called compute_per_rank and give it to the constructor of a metric.
But could you please explain why do you need to evaluate each rank separately?

@iXce
Copy link
Author

iXce commented Mar 16, 2023

Thanks @iXce for the suggestion. We could also add an boolean attribute to Metric which all metrics descend from, called compute_per_rank and give it to the constructor of a metric. But could you please explain why do you need to evaluate each rank separately?

The most typical use case would be that we run distributed training but only run validation on the chief worker (because the validation set is small, or uses a different procedure than training that doesn't necessarily work distributed training).

I was a bit puzzled initially to see that the ignite metrics automatically react to the usage of distributed training (practically speaking we were seeing them hang as they were waiting to allreduce() but none of the other workers would participate).

Regarding where to place the knob for this, it feels like a tough question: in a sense, one might want to be able to switch from single-worker to distributed-validation&metrics configuration without having to reconfigure the metrics, I think? So a constructor parameter to Metric may be somewhat cumbersome?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2023

@iXce thanks for the RFC! We also wanted to be able to override metrics decorators responsible for reducing data: #1288. So, you could override it and do a no-op for your use cases. Or provide a DDP subgroup and reduce over a subgroup.
What do you think ?

@iXce
Copy link
Author

iXce commented Mar 17, 2023

@iXce thanks for the RFC! We also wanted to be able to override metrics decorators responsible for reducing data: #1288. So, you could override it and do a no-op for your use cases. Or provide a DDP subgroup and reduce over a subgroup. What do you think ?

Oh yeah that looks super relevant indeed!

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 21, 2023

@iXce would you like to implement this feature as suggested in #1288 ?
I'll update the issue with an API proposal and we can iterate over. If you would like you can join our discord server and we can talk in a more fluent way about this. Let us know what do you think. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: contrib Contrib module module: distributed Distributed module module: metrics Metrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants