Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated precision_recall_curve.py #2490

Merged
merged 30 commits into from
Mar 8, 2022
Merged

Conversation

sayantan1410
Copy link
Contributor

@sayantan1410 sayantan1410 commented Feb 24, 2022

Fixes #1284

Updated the file precision_recall_curve.py file.
For testing I have ran the tests in the tests/ignite/contrib/metrics/test_precision_recall_curve.py and all of them passed but haven't separately tested in distributed environment.
It could contain some issues though.

@github-actions github-actions bot added the module: contrib Contrib module label Feb 24, 2022
@sdesrozis
Copy link
Contributor

sdesrozis commented Feb 24, 2022

@sayantan1410 Thanks ! I left a few comments.

@sayantan1410
Copy link
Contributor Author

sayantan1410 commented Feb 24, 2022

@sdesrozis Corrected !! Please check.
Also, to solve the DDP issue, I have to write the module from scratch and use the reset, compute and update methods along with the necessary decorators right ?

@sdesrozis
Copy link
Contributor

sdesrozis commented Feb 24, 2022

Good ! Although, the issue is not solved. Actually, the tests should be refactored as cohen kappa ones. You can see the difference, it misses ddp tests. Could you try do improve the tests ?

@sayantan1410
Copy link
Contributor Author

Yes, Sure I can do that, no problem !!

@sdesrozis
Copy link
Contributor

Yes, Sure I can do that, no problem !!

Thanks ! It should be very similar to tests/ignite/contrib/metrics/test_cohen_kappa.py !

@sayantan1410
Copy link
Contributor Author

@vfdev-5 I have updated the epoch_metric, so that it can broadcast tuples. Let me know what to correct.

@sdesrozis
Copy link
Contributor

@sayantan1410 Thanks for the update. You should modify the tests to validate your improvement. See this file tests/ignite/metrics/test_epoch_metric.py

@sayantan1410
Copy link
Contributor Author

@sayantan1410 Thanks for the update. You should modify the tests to validate your improvement. See this file tests/ignite/metrics/test_epoch_metric.py

Okay will check it out.

@sayantan1410
Copy link
Contributor Author

Hey, Can you tell why all the tests are not done when autopep8 makes a commit.

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 6, 2022

I don't know. Did you apply all the tools (as blake, flake, etc.) before pushing ?

Anyway, have a look here

https://github.com/pytorch/ignite/runs/5441042415?check_suite_focus=true

It seems that a few things does not work fine in the code. Please format correctly the code and push.

@sayantan1410
Copy link
Contributor Author

It is asking me to add a blank line between line 162 and 163 in test_precision_recall_curve.py but that has been added by the autopep8 commit just after that, but I cannot figure out how to run the check for that commit. The problem raised as I haven't formatted the code with lint before pushing it.

@sayantan1410
Copy link
Contributor Author

@sdesrozis @vfdev-5 The windows, TPU and macos tests are successful in distributed environment but the ubuntu and hvd tests are failing, can you help me with this ?

@sdesrozis
Copy link
Contributor

@sayantan1410 I think your tests are failing somewhere. The test test_precision_recall_curve::test_distrib_gloo_cpu_or_gpu is the first that fails, and since it's a distributed one, it causes all others distributed tests to fail.

Did you try to launch the parallel tests on your own machine ?

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 7, 2022

I pushed a fix (I hope). I used safe mode of idist.broadcast() and np arrays were converted before.

Let's see.

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 7, 2022

@sayantan1410 Could you fix the tests ? Results are not equal to sklearn because tensors were not converted to numpy in asserts. Thanks !

@sayantan1410
Copy link
Contributor Author

Yes, and the distributed tests have passed as well locally in CPU, but I cannot run the tests on GPU as I do not have GPU in my machine.

@sayantan1410 I think your tests are failing somewhere. The test test_precision_recall_curve::test_distrib_gloo_cpu_or_gpu is the first that fails, and since it's a distributed one, it causes all others distributed tests to fail.

Did you try to launch the parallel tests on your own machine ?

@sayantan1410
Copy link
Contributor Author

@sayantan1410 Could you fix the tests ? Results are not equal to sklearn because tensors were not converted to numpy in asserts. Thanks !

Thank you so much for the help, I will surely do that.

@sdesrozis
Copy link
Contributor

Yes, and the distributed tests have passed as well locally in CPU, but I cannot run the tests on GPU as I do not have GPU in my machine.

You didn't run distributed tests because they didn't work. No GPU needed. In the tests folder, you must run the run_cpu_tests.py locally instead of pytest. Explore the script to catch how run distributed tests 😉

Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayantan1410 You did it ! Thanks a lot for this work !

@sdesrozis sdesrozis merged commit ccbd6af into pytorch:master Mar 8, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 8, 2022

@sayantan1410 @sdesrozis there is an issue with this PR, check this GPU CI job: https://app.circleci.com/pipelines/github/pytorch/ignite/2510/workflows/ced3dff0-05c4-4c92-b695-9c325b48af93/jobs/7720

____________________________ test_distrib_nccl_gpu _____________________________

distributed_context_single_node_nccl = {'local_rank': 0, 'rank': 0, 'world_size': 1}

    @pytest.mark.distributed
    @pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support")
    @pytest.mark.skipif(torch.cuda.device_count() < 1, reason="Skip if no GPU")
    def test_distrib_nccl_gpu(distributed_context_single_node_nccl):
    
        device = idist.device()
>       _test_distrib_compute(device)

tests/ignite/contrib/metrics/test_precision_recall_curve.py:251: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/contrib/metrics/test_precision_recall_curve.py:191: in _test_distrib_compute
    _test(y_pred, y, batch_size, idist.device())
tests/ignite/contrib/metrics/test_precision_recall_curve.py:168: in _test
    res = prc.compute()
ignite/contrib/metrics/precision_recall_curve.py:103: in compute
    precision, recall, thresholds = self.compute_fn(_prediction_tensor, _target_tensor)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

y_preds = tensor([1, 1, 0, 1, 1, 0, 1, 1, 0, 0], device='cuda:0')
y_targets = tensor([0, 1, 0, 1, 1, 1, 0, 1, 0, 0], device='cuda:0')

    def precision_recall_curve_compute_fn(y_preds: torch.Tensor, y_targets: torch.Tensor) -> Tuple[Any, Any, Any]:
        try:
            from sklearn.metrics import precision_recall_curve
        except ImportError:
            raise RuntimeError("This contrib module requires sklearn to be installed.")
    
>       y_true = y_targets.numpy()
E       TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

ignite/contrib/metrics/precision_recall_curve.py:16: TypeError
_________________________ test_distrib_gloo_cpu_or_gpu _________________________

distributed_context_single_node_gloo = {'local_rank': 0, 'rank': 0, 'world_size': 1}

    @pytest.mark.distributed
    @pytest.mark.skipif(not idist.has_native_dist_support, reason="Skip if no native dist support")
    def test_distrib_gloo_cpu_or_gpu(distributed_context_single_node_gloo):
    
        device = idist.device()
>       _test_distrib_compute(device)

tests/ignite/contrib/metrics/test_precision_recall_curve.py:260: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/ignite/contrib/metrics/test_precision_recall_curve.py:191: in _test_distrib_compute
    _test(y_pred, y, batch_size, idist.device())
tests/ignite/contrib/metrics/test_precision_recall_curve.py:168: in _test
    res = prc.compute()
ignite/contrib/metrics/precision_recall_curve.py:103: in compute
    precision, recall, thresholds = self.compute_fn(_prediction_tensor, _target_tensor)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

y_preds = tensor([1, 1, 0, 1, 1, 0, 1, 1, 0, 0], device='cuda:0')
y_targets = tensor([0, 1, 0, 1, 1, 1, 0, 1, 0, 0], device='cuda:0')

    def precision_recall_curve_compute_fn(y_preds: torch.Tensor, y_targets: torch.Tensor) -> Tuple[Any, Any, Any]:
        try:
            from sklearn.metrics import precision_recall_curve
        except ImportError:
            raise RuntimeError("This contrib module requires sklearn to be installed.")
    
>       y_true = y_targets.numpy()
E       TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@sdesrozis
Copy link
Contributor

I'm on it.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayantan1410 thanks for the PR and sorry for delay. Please send another one to fix my review comments.


def compute(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
if len(self._predictions) < 1 or len(self._targets) < 1:
raise NotComputableError("EpochMetric must have at least one example before it can be computed.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- "EpochMetric must have ..."
+ "PrecisionRecallCurve must have ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing it soon.

Comment on lines +104 to +108
precision = torch.Tensor(precision)
recall = torch.Tensor(recall)
# thresholds can have negative strides, not compatible with torch tensors
# https://discuss.pytorch.org/t/negative-strides-in-tensor-error/134287/2
thresholds = torch.Tensor(thresholds.copy())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor creation should be done with torch.tensor and not torch.Tensor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will change it.

@sayantan1410
Copy link
Contributor Author

@sdesrozis Thank you for all the help, got to learn a lot !!


def _test_distrib_integration(device):

rank = idist.get_rank()
Copy link
Collaborator

@vfdev-5 vfdev-5 Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case does not use rank and each process generates some random preds and true values per distributed process. PrecisionRecallCurve implementation should gather all data from all processes but it is checked against local process computation:

        np_y_true = y_true.cpu().numpy().ravel()
        np_y_preds = y_preds.cpu().numpy().ravel()
        sk_precision, sk_recall, sk_thresholds = precision_recall_curve(np_y_true, np_y_preds)

I would assume this to fail but looks like it is passing. @sayantan1410 can you check why it is so ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sur will check it !!

Copy link
Contributor

@sdesrozis sdesrozis Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the error comes from the following test which seems wrong (and need a fix too)

@sayantan1410 Have a look to this correct implementation

def _test(n_epochs, metric_device):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdesrozis Will check it soon !!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdesrozis A small question,

def _test(n_epochs, metric_device):

Here, also we don't have something like idist.all_gather so how is the data getting gathered from all the processes ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: contrib Contrib module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that all core and contrib metrics can work in DDP
3 participants