Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port Ray example to Ignite #1735

Open
vfdev-5 opened this issue Mar 3, 2021 · 14 comments
Open

Port Ray example to Ignite #1735

vfdev-5 opened this issue Mar 3, 2021 · 14 comments

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 3, 2021

@Devanshu24
Copy link
Contributor

Hi!
I do not have much experience in distributed algorithms but I really like them and am learning them. I think it'll be really great if I could work on this as it'll provide some great exposure to ML and Distributed workflows(both of which I really like :D). However, I am not sure I'll be able to work on it at a very fast pace, so if it's urgent(or not doable by beginners) then someone else can please take it up, if not I'd love to work on it :)

@sdesrozis
Copy link
Contributor

sdesrozis commented Mar 6, 2021

@vfdev-5 Your idea is to use ray.tune as in the doc you mentioned ? I mean the experiment tool ?

@Devanshu24 if so, the baseline of this should be our cifar distributed training use case. Please see https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10

If you are motivated to learn about distributed training, why not have a look to the link above ? Before going further, it would be important to be comfortable with this. What do you think ?

@Devanshu24
Copy link
Contributor

Thanks for the reply @sdesrozis !
To confirm if I am getting it correctly, we want to use ray.tune and other distributed utilities provided by ray and see how it performs in comparison to the cifar example already in ignite(https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10)
Correct?
If so, then sure I completely agree I'll start with going through the ignite example and hopefully make some headway and start with the ray implementation! :D

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Mar 7, 2021

The idea is to port this example : https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/cifar10_pytorch.py :

  • use Ignite for training and validation
  • use ray tune for hyperparam tuning

as a simple script file to examples/contrib/cifar10_ray_tune

Great addition will be a PR to ray docs with the example.

@Rajathbharadwaj
Copy link

Can't we not to the same way ray is implemented in PL?
Creating callbacks?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Mar 8, 2021

Can't we not to the same way ray is implemented in PL?
Creating callbacks?

Please, detail your idea ?

@Rajathbharadwaj
Copy link

https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html#training-with-gpus
Similar to the above
An abstract implementation

from ray.tune.integration.pytorch_ignite import TuneReportCallback

def run(train_batch_size, val_batch_size, epochs, lr, momentum, log_dir):
    train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
    model = Net()
    -----------------------#callback to ray.TuneReportCallback----------------
    trainer = pi.Trainer(
        ....,

        callbacks=[
            TuneReportCallback(
                {
                    "loss": "ptl/val_loss",
                    "mean_accuracy": "ptl/val_accuracy"
                },
                on="validation_end")
        ])
    device = "cpu"

    if torch.cuda.is_available():
        device = "cuda"

    model.to(device)  # Move model before creating optimizer
    optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
    criterion = nn.CrossEntropyLoss()
    trainer = create_supervised_trainer(model, optimizer, criterion, device=device)
    trainer.logger = setup_logger("Trainer")

    if sys.version_info > (3,):
        from ignite.contrib.metrics.gpu_info import GpuInfo

        try:
            GpuInfo().attach(trainer)
        except RuntimeError:
            print(
                "INFO: By default, in this example it is possible to log GPU information (used memory, utilization). "
                "As there is no pynvml python package installed, GPU information won't be logged. Otherwise, please "
                "install it : `pip install pynvml`"
            )

    metrics = {"accuracy": Accuracy(), "loss": Loss(criterion)}

    train_evaluator = create_supervised_evaluator(model, metrics=metrics, device=device)
    train_evaluator.logger = setup_logger("Train Evaluator")
    validation_evaluator = create_supervised_evaluator(model, metrics=metrics, device=device)
    validation_evaluator.logger = setup_logger("Val Evaluator")
def tune_mnist_asha(num_samples=10, num_epochs=10, gpus_per_trial=0):
    data_dir = os.path.join(tempfile.gettempdir(), "mnist_data_")
   

    config = {
        "layer_1_size": tune.choice([32, 64, 128]),
        "layer_2_size": tune.choice([64, 128, 256]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
    }

    scheduler = ASHAScheduler(
        max_t=num_epochs,
        grace_period=1,
        reduction_factor=2)

    reporter = CLIReporter(
        parameter_columns=["layer_1_size", "layer_2_size", "lr", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    analysis = tune.run(
        tune.with_parameters(
            train_mnist_tune,
            data_dir=data_dir,
            num_epochs=num_epochs,
            num_gpus=gpus_per_trial),
        resources_per_trial={
            "cpu": 1,
            "gpu": gpus_per_trial
        },
        metric="loss",
        mode="min",
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter,
        name="tune_mnist_asha")

    print("Best hyperparameters found were: ", analysis.best_config)

    shutil.rmtree(data_dir)

Since most of the heavy-lifting is done by ray we could extrapolate by adding a pytorch_ignite in the ray.tune.intergration namespace module and implementing ignite's particular way of calling is what I was thinking!

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Mar 8, 2021

@Rajathbharadwaj thanks for the details. Yes, this would be great!

@Rajathbharadwaj
Copy link

Awesome, I'll work on the integration
Any tips would be awesome!

https://github.com/ray-project/ray/blob/master/python/ray/tune/integration/pytorch_lightning.py

Converting to pytorch Ignite's way of implementing.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Mar 22, 2021

@Rajathbharadwaj any updates on this porting ?

@Rajathbharadwaj
Copy link

Hey @vfdev-5, I got a bit held up. But I'm working on it. Will ping you.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented May 15, 2021

@Rajathbharadwaj still working on this issue ?

@gucifer
Copy link
Contributor

gucifer commented Feb 12, 2022

Hey @vfdev-5 , if no one else if working on this, can I pick this up?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 12, 2022

Hey @vfdev-5 , if no one else if working on this, can I pick this up?

Sure, go ahead. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants