[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

xiadingZ · 2020-06-16T02:29:09Z

I try the new Accuracy Metric, but it throws error:

Traceback (most recent call last):
  File "main.py", line 139, in <module>
    main(hparams)
  File "main.py", line 69, in main
    trainer.fit(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 820, in fit
    self.ddp_train(task, model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 502, in ddp_train
    self.run_pretrain_routine(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in run_pretrain_routine
    False)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 278, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 418, in evaluation_forward
    output = model(*args)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 558, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 96, in forward
    output = self.module.validation_step(*inputs[0], **kwargs[0])
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 374, in validation_step
    acc = self.accuracy(labels_hat, labels)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/metric.py", line 147, in __call__
    return apply_to_collection(self._orig_call(*args, **kwargs), torch.Tensor,
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/converters.py", line 59, in new_func
    return func_to_apply(result, *dec_args, **dec_kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/metrics/converters.py", line 244, in _sync_ddp_if_available
    async_op=False)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

This is my code:

            pred = pred.view(-1, pred.shape[-1])
            labels = labels.view(-1)
            valid_index = torch.where(labels != -1)
            # select valid part to calculate
            pred = pred[valid_index].contiguous()
            labels = labels[valid_index].contiguous()
            loss = self.loss_fn(pred, labels)
            labels_hat = torch.argmax(pred, dim=1).type_as(labels)
            acc = self.accuracy(labels_hat, labels)

Also have a question, TensorMetric's default reduce_op is SUM, does it automatically calculate average acc?

The text was updated successfully, but these errors were encountered:

justusschock · 2020-06-16T11:37:36Z

1.) What are your devices for labels_hat and labels? Are you running in a DDP environment?

2.) No it doesn't. It does what it says (calculates the sum) unfortunately there is no DDP reduction op that calculates the average. For averaging, you still need to divide by the size of your process group

xiadingZ · 2020-06-16T11:47:34Z

This is my code:

            imgs = batch['imgs']
            labels = batch['labels']
            result = self(imgs)

            pred = result['total']
            pred = pred.view(-1, pred.shape[-1])
            labels = labels.view(-1)
            valid_index = torch.where(labels != -1)
            # select valid part to calculate
            pred = pred[valid_index]
            labels = labels[valid_index]
            loss = self.loss_fn(pred, labels)
            labels_hat = torch.argmax(pred, dim=1).type_as(labels)
            acc = self.accuracy(labels_hat, labels)

I'm running in DDP environment, I think labels be automatically transfered to one gpu device, and I use type_as to ensure labels_hat on same device as labels

justusschock · 2020-06-16T11:51:39Z

can you try to call .contiguous() on the tensors before?

xiadingZ · 2020-06-16T11:53:22Z

I tried on labels and labels_hat, but it doesn't work

justusschock · 2020-06-16T11:53:53Z

do you use sparse tensors?

xiadingZ · 2020-06-16T11:54:42Z

No

xiadingZ · 2020-06-16T11:59:43Z

And I think 2) should add some example in docs. Now code example in docs is

# PyTorch Lightning
class MyModule(LightningModule):
    def __init__(self):
        super().__init__()
        self.metric = Accuracy()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = ...
        acc = self.metric(y_hat, y)

and it says can run in ddp mode, but it doesn't say we should divide by the size of process group by hand if using ddp

justusschock · 2020-06-16T12:00:53Z

But it also does not state, that it calculates the mean. I will have a look how much work it is, to integrate this.

SkafteNicki · 2020-09-01T18:02:47Z

@xiadingZ are you still facing the RuntimeError: Tensors must be CUDA and dense error?
Your second point, about dividing by result by process group can be achieved by setting the reduce_op argument to either avg or mean (solved by PR #2568)

edenlightning · 2020-09-16T18:31:56Z

closing this. please comment if this needs to be reopened.

wconnell · 2020-10-03T21:23:33Z

@xiadingZ are you still facing the RuntimeError: Tensors must be CUDA and dense error?

I am running into this issue, using R2Score metric. Same traceback.

SkafteNicki · 2020-10-05T13:11:47Z

@wconnell is am not able to reproduce on master using R2Score. Do you have an code example that can reproduce the error?

limberc · 2021-04-05T16:06:45Z

Got the same problem.

I believed some value has been assigned/computed to nan in this case.

Solved when no nan recorded.

andrewssobral · 2021-07-30T12:52:26Z

Hello All,

I have the same issue by running the following code:
https://gist.github.com/andrewssobral/090dcab34308bdd1ed75e5f2f6b4a1d0

For info, this code was tested with the previous version of the PyTorch Lightning and no had issues.

This is the output I got:

$ python pytorch_lightning_distributed_training.py --accelerator ddp --gpus 1 --max_epochs 3
Namespace(accelerator='ddp', accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=64, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, dataset_path='./', default_root_dir=None, deterministic=False, devices=None, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=1, gradient_clip_algorithm='norm', gradient_clip_val=0.0, ipus=None, learning_rate=0.0002, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, max_epochs=3, max_steps=None, max_time=None, min_epochs=None, min_steps=None, model_checkpoint_enabled=False, model_checkpoint_path='checkpoints/', move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=1, num_sanity_val_steps=2, optimizer='adam', overfit_batches=0.0, plugins=None, precision=32, prepare_data_per_node=True, process_position=0, profiler=None, progress_bar_refresh_rate=None, reload_dataloaders_every_epoch=False, reload_dataloaders_every_n_epochs=0, replace_sampler_ddp=True, resume_from_checkpoint=None, stochastic_weight_avg=False, sync_batchnorm=False, tensorboard_enabled=False, tensorboard_logdir='logs/', terminate_on_nan=False, tpu_cores=None, track_grad_norm=-1, truncated_bptt_steps=None, val_check_interval=1.0, weights_save_path=None, weights_summary='top')
Using existing MNIST data set at ./MNIST
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name             | Type       | Params
------------------------------------------------
0 | train_acc_metric | Accuracy   | 0     
1 | val_acc_metric   | Accuracy   | 0     
2 | model            | Sequential | 55.1 K
------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
Validation sanity check:  50%|██████████████████████████████████████████████████████████████████████▌                                                                      | 1/2 [00:00<00:00,  2.51it/s]Epoch validation acc:  tensor(0.0469, device='cuda:0')
Epoch 0:   0%|                                                                                                                                                         | 0/939 [00:00<00:00, 5722.11it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 143.42it/s, loss=0.346, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 0:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 139.63it/s, loss=0.346, v_num=]Epoch validation acc:  tensor(0.9118, device='cuda:0')██████████████████████████████████████████████████▉                                                               | 46/79 [00:00<00:00, 109.08it/s]
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:06<00:00, 139.06it/s, loss=0.346, v_num=]Epoch training acc:  tensor(0.8115, device='cuda:0')                                                                                                                                                     
Epoch 1:   0%|                                                                                                                                     | 0/939 [00:00<00:00, 5102.56it/s, loss=0.346, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 1:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 141.58it/s, loss=0.233, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 1:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 137.81it/s, loss=0.233, v_num=]Epoch validation acc:  tensor(0.9348, device='cuda:0')████████████████████████████████████████████████████▊                                                             | 47/79 [00:00<00:00, 105.55it/s]
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:06<00:00, 137.19it/s, loss=0.233, v_num=]Epoch training acc:  tensor(0.9109, device='cuda:0')                                                                                                                                                     
Epoch 2:   0%|                                                                                                                                     | 0/939 [00:00<00:00, 4604.07it/s, loss=0.233, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 2:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 137.53it/s, loss=0.224, v_num=][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)                                                                                  | 0/79 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Epoch 2:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 928/939 [00:06<00:00, 134.71it/s, loss=0.224, v_num=]Epoch validation acc:  tensor(0.9428, device='cuda:0')████████████████████████████████████████████████████████████████▎                                                 | 53/79 [00:00<00:00, 124.79it/s]
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 134.27it/s, loss=0.224, v_num=]Epoch training acc:  tensor(0.9289, device='cuda:0')                                                                                                                                                     
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 134.07it/s, loss=0.224, v_num=]
/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:37: UserWarning: The ``compute`` method of metric Accuracy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "pytorch_lightning_distributed_training.py", line 280, in <module>
    main(args)
  File "pytorch_lightning_distributed_training.py", line 249, in main
    print("Metrics:\n", model.get_metrics())
  File "pytorch_lightning_distributed_training.py", line 115, in get_metrics
    train_acc = self.train_acc_metric.compute()
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 368, in wrapped_func
    dist_sync_fn=self.dist_sync_fn, should_sync=self._to_sync, should_unsync=self._should_unsync
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 342, in sync_context
    distributed_available=distributed_available
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 288, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/metric.py", line 228, in _sync_dist
    group=process_group or self.process_group,
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/home/proactive/.local/lib/python3.6/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/proactive/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

andrewssobral · 2021-07-30T13:31:10Z

My reported issue disappears when I downgrade my torchmetrics to version 0.2.0 as you can see below:

Anyone knows how can I fix it for torchmetrics>=0.2.0 ?

$ python pytorch_lightning_distributed_training.py --accelerator ddp --gpus 1 --max_epochs 3
Namespace(accelerator='ddp', accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=64, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, dataset_path='./', default_root_dir=None, deterministic=False, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=1, gradient_clip_algorithm='norm', gradient_clip_val=0.0, learning_rate=0.0002, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, max_epochs=3, max_steps=None, max_time=None, min_epochs=None, min_steps=None, model_checkpoint_enabled=False, model_checkpoint_path='checkpoints/', move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=1, num_sanity_val_steps=2, optimizer='adam', overfit_batches=0.0, plugins=None, precision=32, prepare_data_per_node=True, process_position=0, profiler=None, progress_bar_refresh_rate=None, reload_dataloaders_every_epoch=False, replace_sampler_ddp=True, resume_from_checkpoint=None, stochastic_weight_avg=False, sync_batchnorm=False, tensorboard_enabled=False, tensorboard_logdir='logs/', terminate_on_nan=False, tpu_cores=None, track_grad_norm=-1, truncated_bptt_steps=None, val_check_interval=1.0, weights_save_path=None, weights_summary='top')
Using existing MNIST data set at ./MNIST
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name             | Type       | Params
------------------------------------------------
0 | train_acc_metric | Accuracy   | 0     
1 | val_acc_metric   | Accuracy   | 0     
2 | model            | Sequential | 55.1 K
------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
Validation sanity check:  50%|██████████████████████████████████████████████████████████████████████▌                                                                      | 1/2 [00:00<00:00,  1.02it/s]Epoch validation acc:  tensor(0.1016, device='cuda:0')
Epoch 0:   0%|                                                                                                                                                                   | 0/939 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 142.59it/s, loss=0.333, v_num=]Epoch training acc:  tensor(0.8189, device='cuda:0')
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 935/939 [00:07<00:00, 128.85it/s, loss=0.333, v_num=]Epoch validation acc:  tensor(0.9130, device='cuda:0')██████████████████████████████████████████████████████████████████████████▌                                        | 58/79 [00:00<00:00, 96.10it/s]
Epoch 1:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 139.76it/s, loss=0.275, v_num=]Epoch training acc:  tensor(0.9116, device='cuda:0')                                                                                                                                                     
Epoch 1:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 912/939 [00:07<00:00, 123.82it/s, loss=0.275, v_num=]Epoch validation acc:  tensor(0.9328, device='cuda:0')█████████████████████████████████████████████████████████████████████████████████▌                                | 62/79 [00:00<00:00, 100.34it/s]
Epoch 2:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉           | 860/939 [00:06<00:00, 138.73it/s, loss=0.211, v_num=]Epoch training acc:  tensor(0.9299, device='cuda:0')                                                                                                                                                     
Epoch 2:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 912/939 [00:07<00:00, 122.98it/s, loss=0.211, v_num=]Epoch validation acc:  tensor(0.9446, device='cuda:0')██████████████████████████████████████████████████████████████████▉                                                | 54/79 [00:00<00:00, 87.77it/s]
Epoch 2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 939/939 [00:07<00:00, 123.56it/s, loss=0.211, v_num=]
Training accuracy on all data: 0.9299454689025879                                                                                                                                                        
Validation accuracy on all data: 0.944599986076355
Metrics:
 {'train_acc': 0.9299454689025879, 'val_acc': 0.944599986076355}

SkafteNicki · 2021-07-30T15:35:10Z

@andrewssobral which version of torch metrics did you run in the first example? (Just trying to figure out if the change happened between v0.2 and v0.3 or between v0.3 and v0.4)

alex2awesome · 2021-07-30T17:59:04Z

FYI I'm also seeing this error for a custom metric when running Pytorch Lightning using Deep Speed stage 2 as an accelerator, using torchmetrics==0.4.1:

class Perplexity(Metric):
    def __init__(self, device, stride=10, dist_sync_on_step=False):
        super().__init__(dist_sync_on_step=dist_sync_on_step)
        self.device = device
        self.add_state('lls', default=torch.tensor(0).to(float).to(self.device), dist_reduce_fx="sum")
        self.stride = stride

    def calculate_ppl_for_sequence(self, input_ids, labels, n_words_in_input, model):
        for word_idx in range(1, n_words_in_input, self.stride):
            input_t = _get_first_k_words(input_ids, word_idx + 1).clone()
            output_t = _get_first_k_words(labels, word_idx + 1).clone()
            _set_first_k_words(output_t, word_idx, -100)
            # todo: use torch.no_grad to make sure there's no gradient computation
            try:  # transformers version > 4.0.0
                loss, _, _ = model.forward(input_ids=input_t, labels=output_t, return_dict=False)
            except:
                loss, _, _ = model.forward(input_ids=input_t, labels=output_t)
            loss = loss.to(self.device)
            self.lls += loss

    def update(self, input_ids, model):
        labels = input_ids.clone()
        if len(labels.size()) > 1: # then, first dim is batch
            for input_i, labels_i in zip(input_ids, labels):
                n_words_in_input = input_i.size()[0]
                self.calculate_ppl_for_sequence(input_i, labels_i, n_words_in_input, model)
        else:
            n_words_in_input = labels.size()[0]
            self.calculate_ppl_for_sequence(input_ids, labels, n_words_in_input, model)

    def reset(self):
        self.lls = torch.tensor(0).to(float).to(self.device)

    def compute(self):
        return torch.exp(self.lls)

Stack trace:

  File "/job/.local/lib/python3.7/site-packages/fine_tuning/language_models.py", line 133, in validation_epoch_end
    ppl = self.perplexity.compute()
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 368, in wrapped_func
    dist_sync_fn=self.dist_sync_fn, should_sync=self._to_sync, should_unsync=self._should_unsync
  File "/opt/bb/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 342, in sync_context
    distributed_available=distributed_available
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 288, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/metric.py", line 228, in _sync_dist
    group=process_group or self.process_group,
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/job/.local/lib/python3.7/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/job/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])

andrewssobral · 2021-07-30T18:05:41Z

Hello @SkafteNicki ,
I tested each version of torchmetrics (https://pypi.org/project/torchmetrics/#history) and the error begins on v0.3.0 until the latest one (v0.4.1 as stated by @alex2awesome ).

alex2awesome · 2021-08-09T23:18:18Z

Is there an update on this issue?

SkafteNicki · 2021-08-10T09:25:06Z

So I tried to run the script provided by @andrewssobral but was not able to reproduce the error.
@alex2awesome could you try master version of torchmetrics?

alex2awesome · 2021-08-10T21:13:16Z

Hi @SkafteNicki , I tried the master version of torchmetrics, unfortunately am seeing the same error.

Shall I provide code for you to try reproducing?

SkafteNicki · 2021-08-11T05:30:44Z

@alex2awesome yes please do so :]

Quintulius · 2021-11-02T08:32:16Z

@SkafteNicki Please find here a code sample reproducing the bug. Should I create a new issue in the torchmetrics depository ?

import os

import torch
from pytorch_lightning.utilities.types import EPOCH_OUTPUT
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from torchmetrics import Accuracy


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)
        self.labels = torch.randint(0, 2, (length, 2))

    def __getitem__(self, index):
        return self.data[index], self.labels[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)
        self.metric = Accuracy()

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = self(x).sum()
        self.log("train_loss", loss)
        self.log("accuracy", self.metric(preds, y))
        return {"loss": loss}

    def training_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
        self.log("end_accuracy", self.metric.compute())

    def validation_step(self, batch, batch_idx):
        x, y = batch
        loss = self(x).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        x, y = batch
        loss = self(x).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        gpus=1,
        accelerator="ddp",
        move_metrics_to_cpu=True
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

Environment:

Collecting environment information...
PyTorch version: 1.10.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-6ubuntu2) 7.5.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-38-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 970
Nvidia driver version: 495.29.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.9
[pip3] torch==1.10.0+cu113
[pip3] torchaudio==0.10.0+cu113
[pip3] torchmetrics==0.5.1
[pip3] torchvision==0.11.1+cu113
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.1            py38hd3c417c_0  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.21.2           py38h20f2e39_0  
[conda] numpy-base                1.21.2           py38h79a1101_0  
[conda] pytorch-lightning         1.4.9                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.10.0+cu113             pypi_0    pypi
[conda] torchaudio                0.10.0+cu113             pypi_0    pypi
[conda] torchmetrics              0.5.1                    pypi_0    pypi
[conda] torchvision               0.11.1+cu113             pypi_0    pypi

SkafteNicki · 2021-11-04T10:54:07Z

@Quintulius your code fails due to move_metrics_to_cpu=True which forces the metrics to cpu, which makes the metric update fail because the input is still on gpu. This has more to do with lightning than torchmetrics, so lets open a new issue there.

Quintulius · 2021-11-05T20:16:06Z

@SkafteNicki Thanks, done in #10379

xiadingZ added bug Something isn't working help wanted Open to be worked on labels Jun 16, 2020

Borda added the priority: 0 High priority task label Jun 16, 2020

Borda assigned SkafteNicki Jun 16, 2020

edenlightning changed the title ~~Accuracy Metric: Tensors must be CUDA and dense~~ [metrics] Accuracy Metric: Tensors must be CUDA and dense Jun 17, 2020

SkafteNicki mentioned this issue Jul 6, 2020

New modular metric interface #2528

Merged

7 tasks

edenlightning added the Metrics label Aug 3, 2020

edenlightning added this to the 0.9.x milestone Sep 16, 2020

edenlightning closed this as completed Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

SkafteNicki commented Sep 1, 2020

edenlightning commented Sep 16, 2020

wconnell commented Oct 3, 2020

SkafteNicki commented Oct 5, 2020

limberc commented Apr 5, 2021

andrewssobral commented Jul 30, 2021

andrewssobral commented Jul 30, 2021

SkafteNicki commented Jul 30, 2021

alex2awesome commented Jul 30, 2021 •

edited

Loading

andrewssobral commented Jul 30, 2021

alex2awesome commented Aug 9, 2021

SkafteNicki commented Aug 10, 2021

alex2awesome commented Aug 10, 2021

SkafteNicki commented Aug 11, 2021

Quintulius commented Nov 2, 2021

SkafteNicki commented Nov 4, 2021

Quintulius commented Nov 5, 2021

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205

Comments

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

xiadingZ commented Jun 16, 2020

justusschock commented Jun 16, 2020

SkafteNicki commented Sep 1, 2020

edenlightning commented Sep 16, 2020

wconnell commented Oct 3, 2020

SkafteNicki commented Oct 5, 2020

limberc commented Apr 5, 2021

andrewssobral commented Jul 30, 2021

andrewssobral commented Jul 30, 2021

SkafteNicki commented Jul 30, 2021

alex2awesome commented Jul 30, 2021 • edited Loading

andrewssobral commented Jul 30, 2021

alex2awesome commented Aug 9, 2021

SkafteNicki commented Aug 10, 2021

alex2awesome commented Aug 10, 2021

SkafteNicki commented Aug 11, 2021

Quintulius commented Nov 2, 2021

SkafteNicki commented Nov 4, 2021

Quintulius commented Nov 5, 2021

alex2awesome commented Jul 30, 2021 •

edited

Loading