-
-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update epoch metrics to use collections #1758
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Moh-Yakoub ! The approach is good, let's simplify few things and it will be ok
f9a3f45
to
15813c3
Compare
@vfdev-5 I've noticed that specifying the output type of the sequence/mapping caused all tests to fail because of the following I am not able to reproduce locally, all my tests pass locally |
Maybe, this is the answer : #1758 (comment) |
@Moh-Yakoub can we fix this one in priority, please ? |
@vfdev-5 I am getting a lot of |
@Moh-Yakoub yes, failure is real and actually we are wrong with the implementation when using broadcast on tensors. The issue is with data types:
I have to think about that... |
I understand the need but I’m quite surprised. It means that every process would not know the handled type. It looks weird to me. Maybe it is because I’m familiar with strongly typed languages. Could the return type of |
@sdesrozis it is a union of known types: a scalar or a sequence/mapping/tuple of tensors. |
@Moh-Yakoub I merged this PR #1839 and it should unblock this PR. So, we can now write safely like that : result = None
if idist.get_rank() == 0:
# Run compute_fn on zero rank only
result = self.compute_fn(_prediction_tensor, _target_tensor)
# compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors.
if not _is_scalar_or_collection_of_tensor(result):
raise TypeError(
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors"
)
if ws > 1:
# broadcast result to all processes
return apply_to_type( # type: ignore
result, (torch.Tensor, float, int), partial(idist.broadcast, src=0, safe_mode=True),
) |
@vfdev-5 Thanks a lot for the info and sorry for the late reply. I will work on this now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Moh-Yakoub I saw you updated the code according to #1758 (comment) but now it still does not use safe_mode
to broadcast... Please, update the PR and remove the comments. Thanks
@vfdev-5 I've already removed the comments. related to broadcasts I'm already using
So the safe_mode is passed. Is there anything else to pass it to the broadcast method |
@Moh-Yakoub any ideas why CI is still failing ? |
@@ -22,7 +22,8 @@ def test_no_sklearn(mock_no_sklearn): | |||
RocCurve() | |||
|
|||
|
|||
def test_roc_curve(): | |||
# TODO uncomment those once #1700 is merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, remove all these comments !
@@ -287,7 +287,7 @@ def test_distrib_gpu(distributed_context_single_node_nccl): | |||
|
|||
@pytest.mark.distributed | |||
@pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support") | |||
def test_distrib_cpu(distributed_context_single_node_gloo): | |||
def _test_distrib_cpu(distributed_context_single_node_gloo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a temp way to disable test, let's enable those tests once the CI is passing on epoch metric distrib tests.
@@ -282,7 +282,7 @@ def test_distrib_gpu(distributed_context_single_node_nccl): | |||
|
|||
@pytest.mark.distributed | |||
@pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support") | |||
def test_distrib_cpu(distributed_context_single_node_gloo): | |||
def _test_distrib_cpu(distributed_context_single_node_gloo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
@@ -25,7 +25,7 @@ def test_no_sklearn(mock_no_sklearn): | |||
pr_curve.compute() | |||
|
|||
|
|||
def test_precision_recall_curve(): | |||
def _test_precision_recall_curve(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
# compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors. | ||
if not _is_scalar_or_collection_of_tensor(result): | ||
raise TypeError( | ||
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors" | |
"output not supported: compute_fn should return scalar, tensor, tuple/list/mapping of tensors, " | |
f"got {type(result)}" |
# compute_fn outputs: scalars, tensors, tuple/list/mapping of tensors. | ||
if not _is_scalar_or_collection_of_tensor(result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should be inside if idist.get_rank() == 0:
I think
@Moh-Yakoub do you plan to finish with this PR ? |
@vfdev-5 this PR have lagged behind now, Are there other PRs that have solved the issue, otherwise I can continue this one? |
@Moh-Yakoub actually we figured out that this is not possible to do what we want here like that. I asked @KickItLikeShika to work on a simplified version of this feature. I propose you to close this one and maybe tackle something else from the list: https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22 What do you think ? |
Sure that sounds good, closing this now. |
Fixes #1757
Description: As title. The main idea is to allow the epochmetrics to use collections of tensors
This is a WIP to gather feedback about the approach, once it's approved I will clean up the implementation and implement more test cases
Check list: