Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] windows compatibility #1623

Open
fkiraly opened this issue Aug 25, 2024 · 7 comments
Open

[MNT] windows compatibility #1623

fkiraly opened this issue Aug 25, 2024 · 7 comments
Labels
maintenance Continuous integration, unit testing & package distribution

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 25, 2024

Tests currently fail on windows (windows-latest)

@fkiraly fkiraly added the maintenance Continuous integration, unit testing & package distribution label Aug 25, 2024
@benHeid
Copy link
Collaborator

benHeid commented Aug 29, 2024

Libuv issues seems to be introduced by torch 2.4.0

Recently, we have rolled out a new TCPStore server backend using libuv, a third-party library for asynchronous I/O. This new server backend aims to address scalability and robustness challenges in large-scale distributed training jobs, such as those with more than 1024 ranks. We ran a series of benchmarks to compare the libuv backend against the old one, and the experiment results demonstrated significant improvements in store initialization time and maintained a comparable performance in store I/O operations.

As a result of these findings, the libuv backend has been set as the default TCPStore server backend in PyTorch 2.4. This change is expected to enhance the performance and scalability of distributed training jobs.

Source: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html

Let me try to figure out how to configure it correctly with libuv...

@benHeid
Copy link
Collaborator

benHeid commented Aug 29, 2024

So according to the tutorial it should be possible to switch off by:

  • setting use_libuv=False when creating dist.TCPStore -> Not applicable since not created directly.
  • set init_method=f"tcp://{addr}:{port}?use_libuv=0", in dist.init_process_group unfortunately, we have no direct control since it is part of PyTorch lightning.
  • set os.environ["USE_LIBUV"] = "0" I do not want to do something like that... :/

Other option would be to not test with DDP Strategy, or to downgrade PyTorch.. Unfortunately, I have no windows system right now so I cannot produce a minimal example to perhaps create an issue at pytorch-lightning so that they might expose the relevant parameters

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 29, 2024

I do have a windows system, can you be specific what we'd need - just an MRE for the failure, or sth more specific?

@benHeid
Copy link
Collaborator

benHeid commented Aug 29, 2024

You might check if this is failing with PyTorch 2.4.0

import pytorch_lightning as pl
import numpy as np
import torch
from torch.nn import MSELoss
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn


class SimpleDataset(Dataset):
    def __init__(self):
        X = np.arange(10000)
        y = X * 2
        X = [[_] for _ in X]
        y = [[_] for _ in y]
        self.X = torch.Tensor(X)
        self.y = torch.Tensor(y)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return {"X": self.X[idx], "y": self.y[idx]}


class MyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(1, 1)
        self.criterion = MSELoss()

    def forward(self, inputs_id, labels=None):
        outputs = self.fc(inputs_id)
        loss = 0
        if labels is not None:
            loss = self.criterion(outputs, labels)
        return loss, outputs

    def train_dataloader(self):
        dataset = SimpleDataset()
        return DataLoader(dataset, batch_size=1000)

    def training_step(self, batch, batch_idx):
        input_ids = batch["X"]
        labels = batch["y"]
        loss, outputs = self(input_ids, labels)
        return {"loss": loss}

    def configure_optimizers(self):
        optimizer = Adam(self.parameters())
        return optimizer


if __name__ == '__main__':
    model = MyModel()
    trainer = pl.Trainer(
        max_epochs=1, 
        accelerator="cpu",
        strategy="ddp")
    trainer.fit(model)

    X = torch.Tensor([[1.0], [51.0], [89.0]])
    _, y = model(X)
    print(y)
    

Hopefully this is the issue with the strategy ddp.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 30, 2024

I can reproduce the error on windows 11, torch 2.4.0, python 3.10, same failure, last lines of traceback:

    return TCPStore(
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

fkiraly added a commit that referenced this issue Aug 30, 2024
This PR skips tests involved in the failures on windows listed in #1623 until the underlying issues are resolved, see #1632 and #1632
@benHeid
Copy link
Collaborator

benHeid commented Aug 30, 2024

Ok I would propose to open an Issue at PyTorch lightning. And perhaps remove the ddp strategy for testing at least for windows or set the environment variable to use the old store and not the Libuv one.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 30, 2024

I've added a skip here #1631, but haven't closed the issue, as the skip of course does not causally solve this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

No branches or pull requests

2 participants