[MNT] windows compatibility #1623

fkiraly · 2024-08-25T11:33:38Z

Tests currently fail on windows (windows-latest)

all python versions: libuv issues, wee [MNT] CI matrix extended to windows-latest #1622. We should check (a) whether this is CI specific or a deeper compatibility issue, and (b) fix it.
python 3.10-3.12, separate issue: [MNT] Windows issue on python 3.10-3.12 #1632

The text was updated successfully, but these errors were encountered:

benHeid · 2024-08-29T12:08:18Z

Libuv issues seems to be introduced by torch 2.4.0

Recently, we have rolled out a new TCPStore server backend using libuv, a third-party library for asynchronous I/O. This new server backend aims to address scalability and robustness challenges in large-scale distributed training jobs, such as those with more than 1024 ranks. We ran a series of benchmarks to compare the libuv backend against the old one, and the experiment results demonstrated significant improvements in store initialization time and maintained a comparable performance in store I/O operations.

As a result of these findings, the libuv backend has been set as the default TCPStore server backend in PyTorch 2.4. This change is expected to enhance the performance and scalability of distributed training jobs.

Source: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html

Let me try to figure out how to configure it correctly with libuv...

benHeid · 2024-08-29T12:51:21Z

So according to the tutorial it should be possible to switch off by:

setting use_libuv=False when creating dist.TCPStore -> Not applicable since not created directly.
set init_method=f"tcp://{addr}:{port}?use_libuv=0", in dist.init_process_group unfortunately, we have no direct control since it is part of PyTorch lightning.
set os.environ["USE_LIBUV"] = "0" I do not want to do something like that... :/

Other option would be to not test with DDP Strategy, or to downgrade PyTorch.. Unfortunately, I have no windows system right now so I cannot produce a minimal example to perhaps create an issue at pytorch-lightning so that they might expose the relevant parameters

fkiraly · 2024-08-29T14:23:29Z

I do have a windows system, can you be specific what we'd need - just an MRE for the failure, or sth more specific?

benHeid · 2024-08-29T16:07:29Z

You might check if this is failing with PyTorch 2.4.0

import pytorch_lightning as pl
import numpy as np
import torch
from torch.nn import MSELoss
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn


class SimpleDataset(Dataset):
    def __init__(self):
        X = np.arange(10000)
        y = X * 2
        X = [[_] for _ in X]
        y = [[_] for _ in y]
        self.X = torch.Tensor(X)
        self.y = torch.Tensor(y)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return {"X": self.X[idx], "y": self.y[idx]}


class MyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(1, 1)
        self.criterion = MSELoss()

    def forward(self, inputs_id, labels=None):
        outputs = self.fc(inputs_id)
        loss = 0
        if labels is not None:
            loss = self.criterion(outputs, labels)
        return loss, outputs

    def train_dataloader(self):
        dataset = SimpleDataset()
        return DataLoader(dataset, batch_size=1000)

    def training_step(self, batch, batch_idx):
        input_ids = batch["X"]
        labels = batch["y"]
        loss, outputs = self(input_ids, labels)
        return {"loss": loss}

    def configure_optimizers(self):
        optimizer = Adam(self.parameters())
        return optimizer


if __name__ == '__main__':
    model = MyModel()
    trainer = pl.Trainer(
        max_epochs=1, 
        accelerator="cpu",
        strategy="ddp")
    trainer.fit(model)

    X = torch.Tensor([[1.0], [51.0], [89.0]])
    _, y = model(X)
    print(y)

Hopefully this is the issue with the strategy ddp.

fkiraly · 2024-08-30T05:57:48Z

I can reproduce the error on windows 11, torch 2.4.0, python 3.10, same failure, last lines of traceback:

    return TCPStore(
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

This PR skips tests involved in the failures on windows listed in #1623 until the underlying issues are resolved, see #1632 and #1632

benHeid · 2024-08-30T14:41:58Z

Ok I would propose to open an Issue at PyTorch lightning. And perhaps remove the ddp strategy for testing at least for windows or set the environment variable to use the old store and not the Libuv one.

fkiraly · 2024-08-30T15:08:29Z

I've added a skip here #1631, but haven't closed the issue, as the skip of course does not causally solve this...

fkiraly added the maintenance Continuous integration, unit testing & package distribution label Aug 25, 2024

fkiraly mentioned this issue Aug 27, 2024

[MNT] maintenance & handover items for integration with sktime org #1592

Open

21 tasks

fkiraly mentioned this issue Aug 30, 2024

[MNT] skip tests involved in windows failures until resolved #1631

Merged

fkiraly added a commit that referenced this issue Aug 30, 2024

[MNT] skip tests involved in windows failures until resolved (#1631)

6c85b8e

This PR skips tests involved in the failures on windows listed in #1623 until the underlying issues are resolved, see #1632 and #1632

This was referenced Aug 30, 2024

[BUG] Issue with using optimise_hyperparameter with PyTorch DDP #1588

Open

[BUG] installation not working on windows #1560

Open

benHeid mentioned this issue Aug 30, 2024

DDPStrategy under windows is complaining about missing libuv Lightning-AI/pytorch-lightning#20238

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNT] windows compatibility #1623

[MNT] windows compatibility #1623

fkiraly commented Aug 25, 2024 •

edited

Loading

benHeid commented Aug 29, 2024

benHeid commented Aug 29, 2024

fkiraly commented Aug 29, 2024

benHeid commented Aug 29, 2024

fkiraly commented Aug 30, 2024

benHeid commented Aug 30, 2024

fkiraly commented Aug 30, 2024

[MNT] windows compatibility #1623

[MNT] windows compatibility #1623

Comments

fkiraly commented Aug 25, 2024 • edited Loading

benHeid commented Aug 29, 2024

benHeid commented Aug 29, 2024

fkiraly commented Aug 29, 2024

benHeid commented Aug 29, 2024

fkiraly commented Aug 30, 2024

benHeid commented Aug 30, 2024

fkiraly commented Aug 30, 2024

fkiraly commented Aug 25, 2024 •

edited

Loading