Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

Closed
xiadingZ opened this issue Jun 16, 2020 · 25 comments
Closed

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

xiadingZ opened this issue Jun 16, 2020 · 25 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@xiadingZ
Copy link

I try the new Accuracy Metric, but it throws error:

Traceback (most recent call last):
  File "main.py", line 139, in <module>
    main(hparams)
  File "main.py", line 69, in main
    trainer.fit(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 820, in fit
    self.ddp_train(task, model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 502, in ddp_train
    self.run_pretrain_routine(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in run_pretrain_routine
    False)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 278, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 418, in evaluation_forward
    output = model(*args)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 558, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 96, in forward
    output = self.module.validation_step(*inputs[0], **kwargs[0])
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 374, in validation_step
    acc = self.accuracy(labels_hat, labels)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/metric.py", line 147, in __call__
    return apply_to_collection(self._orig_call(*args, **kwargs), torch.Tensor,
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/converters.py", line 59, in new_func
    return func_to_apply(result, *dec_args, **dec_kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/converters.py", line 244, in _sync_ddp_if_available
    async_op=False)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

This is my code:

            pred = pred.view(-1, pred.shape[-1])
            labels = labels.view(-1)
            valid_index = torch.where(labels != -1)
            # select valid part to calculate
            pred = pred[valid_index].contiguous()
            labels = labels[valid_index].contiguous()
            loss = self.loss_fn(pred, labels)
            labels_hat = torch.argmax(pred, dim=1).type_as(labels)
            acc = self.accuracy(labels_hat, labels)

Also have a question, TensorMetric's default reduce_op is SUM, does it automatically calculate average acc?

@xiadingZ xiadingZ added bug Something isn't working help wanted Open to be worked on labels Jun 16, 2020
@justusschock
Copy link
Member

1.) What are your devices for labels_hat and labels? Are you running in a DDP environment?

2.) No it doesn't. It does what it says (calculates the sum) unfortunately there is no DDP reduction op that calculates the average. For averaging, you still need to divide by the size of your process group

@xiadingZ
Copy link
Author

This is my code:

            imgs = batch['imgs']
            labels = batch['labels']
            result = self(imgs)

            pred = result['total']
            pred = pred.view(-1, pred.shape[-1])
            labels = labels.view(-1)
            valid_index = torch.where(labels != -1)
            # select valid part to calculate
            pred = pred[valid_index]
            labels = labels[valid_index]
            loss = self.loss_fn(pred, labels)
            labels_hat = torch.argmax(pred, dim=1).type_as(labels)
            acc = self.accuracy(labels_hat, labels)

I'm running in DDP environment, I think labels be automatically transfered to one gpu device, and I use type_as to ensure labels_hat on same device as labels

@justusschock
Copy link
Member

can you try to call .contiguous() on the tensors before?

@xiadingZ
Copy link
Author

I tried on labels and labels_hat, but it doesn't work

@justusschock
Copy link
Member

do you use sparse tensors?

@xiadingZ
Copy link
Author

No

@xiadingZ
Copy link
Author

And I think 2) should add some example in docs. Now code example in docs is

# PyTorch Lightning
class MyModule(LightningModule):
    def __init__(self):
        super().__init__()
        self.metric = Accuracy()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = ...
        acc = self.metric(y_hat, y)

and it says can run in ddp mode, but it doesn't say we should divide by the size of process group by hand if using ddp

@justusschock
Copy link
Member

But it also does not state, that it calculates the mean. I will have a look how much work it is, to integrate this.

@Borda Borda added the priority: 0 High priority task label Jun 16, 2020
@edenlightning edenlightning changed the title Accuracy Metric: Tensors must be CUDA and dense [metrics] Accuracy Metric: Tensors must be CUDA and dense Jun 17, 2020
@SkafteNicki
Copy link
Member

@xiadingZ are you still facing the RuntimeError: Tensors must be CUDA and dense error?
Your second point, about dividing by result by process group can be achieved by setting the reduce_op argument to either avg or mean (solved by PR #2568)

@edenlightning edenlightning added this to the 0.9.x milestone Sep 16, 2020
@edenlightning
Copy link
Contributor

closing this. please comment if this needs to be reopened.

@wconnell
Copy link

wconnell commented Oct 3, 2020

@xiadingZ are you still facing the RuntimeError: Tensors must be CUDA and dense error?

I am running into this issue, using R2Score metric. Same traceback.

@SkafteNicki
Copy link
Member

@wconnell is am not able to reproduce on master using R2Score. Do you have an code example that can reproduce the error?

@limberc
Copy link
Contributor

limberc commented Apr 5, 2021

Got the same problem.

I believed some value has been assigned/computed to nan in this case.

Solved when no nan recorded.

@andrewssobral
Copy link

Hello All,

I have the same issue by running the following code:
https://gist.github.com/andrewssobral/090dcab34308bdd1ed75e5f2f6b4a1d0

For info, this code was tested with the previous version of the PyTorch Lightning and no had issues.

This is the output I got:

$ python pytorch_lightning_distributed_training.py --accelerator ddp --gpus 1 --max_epochs 3
Namespace(accelerator='ddp', accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=64, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, dataset_path='./', default_root_dir=None, deterministic=False, devices=None, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=1, gradient_clip_algorithm='norm', gradient_clip_val=0.0, ipus=None, learning_rate=0.0002, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, max_epochs=3, max_steps=None, max_time=None, min_epochs=None, min_steps=None, model_checkpoint_enabled=False, model_checkpoint_path='checkpoints/', move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=1, num_sanity_val_steps=2, optimizer='adam', overfit_batches=0.0, plugins=None, precision=32, prepare_data_per_node=True, process_position=0, profiler=None, progress_bar_refresh_rate=None, reload_dataloaders_every_epoch=False, reload_dataloaders_every_n_epochs=0, replace_sampler_ddp=True, resume_from_checkpoint=None, stochastic_weight_avg=False, sync_batchnorm=False, tensorboard_enabled=False, tensorboard_logdir='logs/', terminate_on_nan=False, tpu_cores=None, track_grad_norm=-1, truncated_bptt_steps=None, val_check_interval=1.0, weights_save_path=None, weights_summary='top')
Using existing MNIST data set at ./MNIST
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name             | Type       | Params
------------------------------------------------
0 | train_acc_metric | Accuracy   | 0     
1 | val_acc_metric   | Accuracy   | 0     
2 | model            | Sequential | 55.1 K
------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
Validation sanity check:  50%|██████████████████████████████████████████████████████████████████████▌                                                                      | 1/2 [00:00<00:00,  2.51it/s]Epoch validation acc:  tensor(0.0469, device='cuda:0')
Epoch 0:   0%|                                                                                                                                                         | 0/939 [00:00<00:00, 5722.11it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 143.42it/s, loss=0.346, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 139.63it/s, loss=0.346, v_num=]Epoch validation acc:  tensor(0.9118, device='cuda:0')██████████████████████████████████████████████████▉                                                               | 46/79 [00:00<00:00, 109.08it/s]
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:06<00:00, 139.06it/s, loss=0.346, v_num=]Epoch training acc:  tensor(0.8115, device='cuda:0')                                                                                                                                                     
Epoch 1:   0%|                                                                                                                                     | 0/939 [00:00<00:00, 5102.56it/s, loss=0.346, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 1:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 141.58it/s, loss=0.233, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 1:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 137.81it/s, loss=0.233, v_num=]Epoch validation acc:  tensor(0.9348, device='cuda:0')████████████████████████████████████████████████████▊                                                             | 47/79 [00:00<00:00, 105.55it/s]
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:06<00:00, 137.19it/s, loss=0.233, v_num=]Epoch training acc:  tensor(0.9109, device='cuda:0')                                                                                                                                                     
Epoch 2:   0%|                                                                                                                                     | 0/939 [00:00<00:00, 4604.07it/s, loss=0.233, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 2:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 137.53it/s, loss=0.224, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 2:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 134.71it/s, loss=0.224, v_num=]Epoch validation acc:  tensor(0.9428, device='cuda:0')████████████████████████████████████████████████████████████████▎                                                 | 53/79 [00:00<00:00, 124.79it/s]
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 134.27it/s, loss=0.224, v_num=]Epoch training acc:  tensor(0.9289, device='cuda:0')                                                                                                                                                     
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 134.07it/s, loss=0.224, v_num=]
/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:37: UserWarning: The ``compute`` method of metric Accuracy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "pytorch_lightning_distributed_training.py", line 280, in <module>
    main(args)
  File "pytorch_lightning_distributed_training.py", line 249, in main
    print("Metrics:\n", model.get_metrics())
  File "pytorch_lightning_distributed_training.py", line 115, in get_metrics
    train_acc = self.train_acc_metric.compute()
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 368, in wrapped_func
    dist_sync_fn=self.dist_sync_fn, should_sync=self._to_sync, should_unsync=self._should_unsync
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 342, in sync_context
    distributed_available=distributed_available
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 288, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 228, in _sync_dist
    group=process_group or self.process_group,
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/proactive/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

@andrewssobral
Copy link

My reported issue disappears when I downgrade my torchmetrics to version 0.2.0 as you can see below:

Anyone knows how can I fix it for torchmetrics>=0.2.0 ?

$ python pytorch_lightning_distributed_training.py --accelerator ddp --gpus 1 --max_epochs 3
Namespace(accelerator='ddp', accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=64, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, dataset_path='./', default_root_dir=None, deterministic=False, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=1, gradient_clip_algorithm='norm', gradient_clip_val=0.0, learning_rate=0.0002, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, max_epochs=3, max_steps=None, max_time=None, min_epochs=None, min_steps=None, model_checkpoint_enabled=False, model_checkpoint_path='checkpoints/', move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=1, num_sanity_val_steps=2, optimizer='adam', overfit_batches=0.0, plugins=None, precision=32, prepare_data_per_node=True, process_position=0, profiler=None, progress_bar_refresh_rate=None, reload_dataloaders_every_epoch=False, replace_sampler_ddp=True, resume_from_checkpoint=None, stochastic_weight_avg=False, sync_batchnorm=False, tensorboard_enabled=False, tensorboard_logdir='logs/', terminate_on_nan=False, tpu_cores=None, track_grad_norm=-1, truncated_bptt_steps=None, val_check_interval=1.0, weights_save_path=None, weights_summary='top')
Using existing MNIST data set at ./MNIST
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name             | Type       | Params
------------------------------------------------
0 | train_acc_metric | Accuracy   | 0     
1 | val_acc_metric   | Accuracy   | 0     
2 | model            | Sequential | 55.1 K
------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
Validation sanity check:  50%|██████████████████████████████████████████████████████████████████████▌                                                                      | 1/2 [00:00<00:00,  1.02it/s]Epoch validation acc:  tensor(0.1016, device='cuda:0')
Epoch 0:   0%|                                                                                                                                                                   | 0/939 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 142.59it/s, loss=0.333, v_num=]Epoch training acc:  tensor(0.8189, device='cuda:0')
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 935/939 [00:07<00:00, 128.85it/s, loss=0.333, v_num=]Epoch validation acc:  tensor(0.9130, device='cuda:0')██████████████████████████████████████████████████████████████████████████▌                                        | 58/79 [00:00<00:00, 96.10it/s]
Epoch 1:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 139.76it/s, loss=0.275, v_num=]Epoch training acc:  tensor(0.9116, device='cuda:0')                                                                                                                                                     
Epoch 1:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 912/939 [00:07<00:00, 123.82it/s, loss=0.275, v_num=]Epoch validation acc:  tensor(0.9328, device='cuda:0')█████████████████████████████████████████████████████████████████████████████████▌                                | 62/79 [00:00<00:00, 100.34it/s]
Epoch 2:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 138.73it/s, loss=0.211, v_num=]Epoch training acc:  tensor(0.9299, device='cuda:0')                                                                                                                                                     
Epoch 2:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 912/939 [00:07<00:00, 122.98it/s, loss=0.211, v_num=]Epoch validation acc:  tensor(0.9446, device='cuda:0')██████████████████████████████████████████████████████████████████▉                                                | 54/79 [00:00<00:00, 87.77it/s]
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 123.56it/s, loss=0.211, v_num=]
Training accuracy on all data: 0.9299454689025879                                                                                                                                                        
Validation accuracy on all data: 0.944599986076355
Metrics:
 {'train_acc': 0.9299454689025879, 'val_acc': 0.944599986076355}

@SkafteNicki
Copy link
Member

@andrewssobral which version of torch metrics did you run in the first example? (Just trying to figure out if the change happened between v0.2 and v0.3 or between v0.3 and v0.4)

@alex2awesome
Copy link

alex2awesome commented Jul 30, 2021

FYI I'm also seeing this error for a custom metric when running Pytorch Lightning using Deep Speed stage 2 as an accelerator, using torchmetrics==0.4.1:

class Perplexity(Metric):
    def __init__(self, device, stride=10, dist_sync_on_step=False):
        super().__init__(dist_sync_on_step=dist_sync_on_step)
        self.device = device
        self.add_state('lls', default=torch.tensor(0).to(float).to(self.device), dist_reduce_fx="sum")
        self.stride = stride

    def calculate_ppl_for_sequence(self, input_ids, labels, n_words_in_input, model):
        for word_idx in range(1, n_words_in_input, self.stride):
            input_t = _get_first_k_words(input_ids, word_idx + 1).clone()
            output_t = _get_first_k_words(labels, word_idx + 1).clone()
            _set_first_k_words(output_t, word_idx, -100)
            # todo: use torch.no_grad to make sure there's no gradient computation
            try:  # transformers version > 4.0.0
                loss, _, _ = model.forward(input_ids=input_t, labels=output_t, return_dict=False)
            except:
                loss, _, _ = model.forward(input_ids=input_t, labels=output_t)
            loss = loss.to(self.device)
            self.lls += loss

    def update(self, input_ids, model):
        labels = input_ids.clone()
        if len(labels.size()) > 1: # then, first dim is batch
            for input_i, labels_i in zip(input_ids, labels):
                n_words_in_input = input_i.size()[0]
                self.calculate_ppl_for_sequence(input_i, labels_i, n_words_in_input, model)
        else:
            n_words_in_input = labels.size()[0]
            self.calculate_ppl_for_sequence(input_ids, labels, n_words_in_input, model)

    def reset(self):
        self.lls = torch.tensor(0).to(float).to(self.device)

    def compute(self):
        return torch.exp(self.lls)

Stack trace:

  File "/job/.local/lib/python3.7/site-packages/fine_tuning/language_models.py", line 133, in validation_epoch_end
    ppl = self.perplexity.compute()
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 368, in wrapped_func
    dist_sync_fn=self.dist_sync_fn, should_sync=self._to_sync, should_unsync=self._should_unsync
  File "/opt/bb/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 342, in sync_context
    distributed_available=distributed_available
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 288, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 228, in _sync_dist
    group=process_group or self.process_group,
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/job/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])

@andrewssobral
Copy link

Hello @SkafteNicki ,
I tested each version of torchmetrics (https://pypi.org/project/torchmetrics/#history) and the error begins on v0.3.0 until the latest one (v0.4.1 as stated by @alex2awesome ).

@alex2awesome
Copy link

Is there an update on this issue?

@SkafteNicki
Copy link
Member

So I tried to run the script provided by @andrewssobral but was not able to reproduce the error.
@alex2awesome could you try master version of torchmetrics?

@alex2awesome
Copy link

Hi @SkafteNicki , I tried the master version of torchmetrics, unfortunately am seeing the same error.

Shall I provide code for you to try reproducing?

@SkafteNicki
Copy link
Member

@alex2awesome yes please do so :]

@Quintulius
Copy link

@SkafteNicki Please find here a code sample reproducing the bug. Should I create a new issue in the torchmetrics depository ?

import os

import torch
from pytorch_lightning.utilities.types import EPOCH_OUTPUT
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from torchmetrics import Accuracy


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)
        self.labels = torch.randint(0, 2, (length, 2))

    def __getitem__(self, index):
        return self.data[index], self.labels[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.metric = Accuracy()

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = self(x).sum()
        self.log("train_loss", loss)
        self.log("accuracy", self.metric(preds, y))
        return {"loss": loss}

    def training_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
        self.log("end_accuracy", self.metric.compute())

    def validation_step(self, batch, batch_idx):
        x, y = batch
        loss = self(x).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        x, y = batch
        loss = self(x).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        gpus=1,
        accelerator="ddp",
        move_metrics_to_cpu=True
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

Environment:

Collecting environment information...
PyTorch version: 1.10.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-6ubuntu2) 7.5.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-38-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 970
Nvidia driver version: 495.29.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.10.0+cu113
[pip3] torchaudio==0.10.0+cu113
[pip3] torchmetrics==0.5.1
[pip3] torchvision==0.11.1+cu113
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.21.2           py38h20f2e39_0  
[conda] numpy-base                1.21.2           py38h79a1101_0  
[conda] pytorch-lightning         1.4.9                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.10.0+cu113             pypi_0    pypi
[conda] torchaudio                0.10.0+cu113             pypi_0    pypi
[conda] torchmetrics              0.5.1                    pypi_0    pypi
[conda] torchvision               0.11.1+cu113             pypi_0    pypi

@SkafteNicki
Copy link
Member

@Quintulius your code fails due to move_metrics_to_cpu=True which forces the metrics to cpu, which makes the metric update fail because the input is still on gpu. This has more to do with lightning than torchmetrics, so lets open a new issue there.

@Quintulius
Copy link

@SkafteNicki Thanks, done in #10379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

10 participants