Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update epoch metrics to use collections #1758

Closed
wants to merge 37 commits into from

Conversation

Moh-Yakoub
Copy link
Contributor

Fixes #1757

Description: As title. The main idea is to allow the epochmetrics to use collections of tensors

This is a WIP to gather feedback about the approach, once it's approved I will clean up the implementation and implement more test cases

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: metrics Metrics module label Mar 8, 2021
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Moh-Yakoub ! The approach is good, let's simplify few things and it will be ok

ignite/metrics/epoch_metric.py Outdated Show resolved Hide resolved
ignite/metrics/epoch_metric.py Outdated Show resolved Hide resolved
ignite/utils.py Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the module: utils Utils module label Mar 13, 2021
@Moh-Yakoub
Copy link
Contributor Author

@vfdev-5 I've noticed that specifying the output type of the sequence/mapping caused all tests to fail because of the following
TypeError: 'ABCMeta' object is not subscriptable. any idea what maybe causing this?

I am not able to reproduce locally, all my tests pass locally

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 15, 2021

@vfdev-5 I've noticed that specifying the output type of the sequence/mapping caused all tests to fail because of the following
TypeError: 'ABCMeta' object is not subscriptable. any idea what maybe causing this?

I am not able to reproduce locally, all my tests pass locally

Maybe, this is the answer : #1758 (comment)
otherwise, please try to google that.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 15, 2021

@Moh-Yakoub can we fix this one in priority, please ?

@Moh-Yakoub Moh-Yakoub changed the title [WIP] update epoch metrics to use collections update epoch metrics to use collections Mar 15, 2021
ignite/metrics/epoch_metric.py Show resolved Hide resolved
ignite/metrics/epoch_metric.py Outdated Show resolved Hide resolved
@Moh-Yakoub
Copy link
Contributor Author

@vfdev-5 I am getting a lot of RuntimeError: connect() timed out. test failures and another one Engine run is terminating due to exception: Mismatched data types: One rank had type int64, but another rank had type float32. on Horovod. Do you suspect it's related to the broadcasting code?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@Moh-Yakoub yes, failure is real and actually we are wrong with the implementation when using broadcast on tensors.

The issue is with data types:

        result = 0.0  # <----- all procs has result as float scalar
        if idist.get_rank() == 0:
            # Run compute_fn on zero rank only
            result = self.compute_fn(_prediction_tensor, _target_tensor)  # <---- now result is a tensor (list of tensors, etc) on rank 0

        # compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors.
        if not _is_scalar_or_collection_of_tensor(result):
            raise TypeError(
                "output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors"
            )

        if ws > 1:
            # broadcast result to all processes
            # !!!! BUG : WE TRY TO BROADCAST TENSOR TO A SCALAR PLACEHOLDERS FOR OTHER PROCS...
            return apply_to_type(  # type: ignore
                result, (torch.Tensor, float, int), partial(idist.broadcast, src=0),
            )   

I have to think about that...

@vfdev-5 vfdev-5 mentioned this pull request Mar 22, 2021
3 tasks
@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 22, 2021

@Moh-Yakoub yes, failure is real and actually we are wrong with the implementation when using broadcast on tensors.

I understand the need but I’m quite surprised. It means that every process would not know the handled type. It looks weird to me. Maybe it is because I’m familiar with strongly typed languages.

Could the return type of self.compute_fn be a union of possible types ?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 22, 2021

Could the return type of self.compute_fn be a union of possible types ?

@sdesrozis it is a union of known types: a scalar or a sequence/mapping/tuple of tensors.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 20, 2021

@Moh-Yakoub I merged this PR #1839 and it should unblock this PR. So, we can now write safely like that :

        result = None
        if idist.get_rank() == 0:
            # Run compute_fn on zero rank only
            result = self.compute_fn(_prediction_tensor, _target_tensor)

        # compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors.
        if not _is_scalar_or_collection_of_tensor(result):
            raise TypeError(
                "output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors"
            )

        if ws > 1:
            # broadcast result to all processes
            return apply_to_type(  # type: ignore
                result, (torch.Tensor, float, int), partial(idist.broadcast, src=0, safe_mode=True),
            )

@Moh-Yakoub
Copy link
Contributor Author

@vfdev-5 Thanks a lot for the info and sorry for the late reply. I will work on this now.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Moh-Yakoub I saw you updated the code according to #1758 (comment) but now it still does not use safe_mode to broadcast... Please, update the PR and remove the comments. Thanks

@Moh-Yakoub
Copy link
Contributor Author

@vfdev-5 I've already removed the comments. related to broadcasts

I'm already using

return apply_to_type(  # type: ignore
                result, (torch.Tensor, float, int), partial(idist.broadcast, src=0, safe_mode=True),

So the safe_mode is passed. Is there anything else to pass it to the broadcast method

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 28, 2021

@Moh-Yakoub any ideas why CI is still failing ?

@@ -22,7 +22,8 @@ def test_no_sklearn(mock_no_sklearn):
RocCurve()


def test_roc_curve():
# TODO uncomment those once #1700 is merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, remove all these comments !

@@ -287,7 +287,7 @@ def test_distrib_gpu(distributed_context_single_node_nccl):

@pytest.mark.distributed
@pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support")
def test_distrib_cpu(distributed_context_single_node_gloo):
def _test_distrib_cpu(distributed_context_single_node_gloo):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a temp way to disable test, let's enable those tests once the CI is passing on epoch metric distrib tests.

@@ -282,7 +282,7 @@ def test_distrib_gpu(distributed_context_single_node_nccl):

@pytest.mark.distributed
@pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support")
def test_distrib_cpu(distributed_context_single_node_gloo):
def _test_distrib_cpu(distributed_context_single_node_gloo):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -25,7 +25,7 @@ def test_no_sklearn(mock_no_sklearn):
pr_curve.compute()


def test_precision_recall_curve():
def _test_precision_recall_curve():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

# compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors.
if not _is_scalar_or_collection_of_tensor(result):
raise TypeError(
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors"
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors, "
f"got {type(result)}"

Comment on lines +141 to +142
# compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors.
if not _is_scalar_or_collection_of_tensor(result):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be inside if idist.get_rank() == 0: I think

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 8, 2021

@Moh-Yakoub do you plan to finish with this PR ?

@Moh-Yakoub
Copy link
Contributor Author

@Moh-Yakoub do you plan to finish with this PR ?

@vfdev-5 this PR have lagged behind now, Are there other PRs that have solved the issue, otherwise I can continue this one?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jul 29, 2021

@Moh-Yakoub actually we figured out that this is not possible to do what we want here like that. I asked @KickItLikeShika to work on a simplified version of this feature. I propose you to close this one and maybe tackle something else from the list: https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22

What do you think ?

@Moh-Yakoub
Copy link
Contributor Author

@Moh-Yakoub actually we figured out that this is not possible to do what we want here like that. I asked @KickItLikeShika to work on a simplified version of this feature. I propose you to close this one and maybe tackle something else from the list: https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22

What do you think ?

Sure that sounds good, closing this now.

@Moh-Yakoub Moh-Yakoub closed this Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: metrics Metrics module module: utils Utils module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More EpochMetric's compute_fn output types
4 participants