-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] Accuracy Metric: Tensors must be CUDA and dense #2205
Comments
1.) What are your devices for labels_hat and labels? Are you running in a DDP environment? 2.) No it doesn't. It does what it says (calculates the sum) unfortunately there is no DDP reduction op that calculates the average. For averaging, you still need to divide by the size of your process group |
This is my code:
I'm running in DDP environment, I think |
can you try to call |
I tried on |
do you use sparse tensors? |
No |
And I think 2) should add some example in docs. Now code example in docs is
and it says can run in ddp mode, but it doesn't say we should divide by the size of process group by hand if using ddp |
But it also does not state, that it calculates the mean. I will have a look how much work it is, to integrate this. |
closing this. please comment if this needs to be reopened. |
I am running into this issue, using |
@wconnell is am not able to reproduce on master using |
Got the same problem. I believed some value has been assigned/computed to nan in this case. Solved when no nan recorded. |
Hello All, I have the same issue by running the following code: For info, this code was tested with the previous version of the PyTorch Lightning and no had issues. This is the output I got:
|
My reported issue disappears when I downgrade my torchmetrics to version 0.2.0 as you can see below: Anyone knows how can I fix it for torchmetrics>=0.2.0 ?
|
@andrewssobral which version of torch metrics did you run in the first example? (Just trying to figure out if the change happened between v0.2 and v0.3 or between v0.3 and v0.4) |
FYI I'm also seeing this error for a custom metric when running Pytorch Lightning using Deep Speed stage 2 as an accelerator, using
Stack trace:
|
Hello @SkafteNicki , |
Is there an update on this issue? |
So I tried to run the script provided by @andrewssobral but was not able to reproduce the error. |
Hi @SkafteNicki , I tried the master version of torchmetrics, unfortunately am seeing the same error. Shall I provide code for you to try reproducing? |
@alex2awesome yes please do so :] |
@SkafteNicki Please find here a code sample reproducing the bug. Should I create a new issue in the torchmetrics depository ? import os
import torch
from pytorch_lightning.utilities.types import EPOCH_OUTPUT
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
from torchmetrics import Accuracy
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
self.labels = torch.randint(0, 2, (length, 2))
def __getitem__(self, index):
return self.data[index], self.labels[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
self.metric = Accuracy()
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
x, y = batch
preds = self(x)
loss = self(x).sum()
self.log("train_loss", loss)
self.log("accuracy", self.metric(preds, y))
return {"loss": loss}
def training_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
self.log("end_accuracy", self.metric.compute())
def validation_step(self, batch, batch_idx):
x, y = batch
loss = self(x).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
x, y = batch
loss = self(x).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
test_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
gpus=1,
accelerator="ddp",
move_metrics_to_cpu=True
)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
trainer.test(model, dataloaders=test_data)
if __name__ == "__main__":
run() Environment:
|
@Quintulius your code fails due to |
@SkafteNicki Thanks, done in #10379 |
I try the new Accuracy Metric, but it throws error:
This is my code:
Also have a question, TensorMetric's default reduce_op is
SUM
, does it automatically calculate average acc?The text was updated successfully, but these errors were encountered: