Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] <Part 4> Lightning Trainer Release tests + docstring sample test #33323

Merged
merged 23 commits into from
Mar 24, 2023
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
30dab8e
make docstring sample code testable
woshiyyya Mar 15, 2023
7fbf44b
add release tests
woshiyyya Mar 15, 2023
5a65129
change testcode tag
woshiyyya Mar 15, 2023
f53031f
clean yaml files
woshiyyya Mar 15, 2023
45ebad0
fixing release tests config
woshiyyya Mar 15, 2023
16a33d4
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 15, 2023
b796f84
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 16, 2023
1c6ed91
downgrade torchmetrics version
woshiyyya Mar 17, 2023
c4fc217
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 20, 2023
b9e723a
debug release test
woshiyyya Mar 20, 2023
bfa2e1a
remove all pip packages, simply use base image
woshiyyya Mar 21, 2023
6e51852
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 21, 2023
8d567f3
add tuner test
woshiyyya Mar 21, 2023
f7de8a7
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 22, 2023
4997e16
address comments
woshiyyya Mar 23, 2023
99b595e
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 23, 2023
2a45a6f
fix lint
woshiyyya Mar 23, 2023
5cc8839
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 23, 2023
8ea7bee
fix testname
woshiyyya Mar 23, 2023
d765324
fix release test submission
woshiyyya Mar 23, 2023
3dfbc59
try to include the util file
woshiyyya Mar 23, 2023
1681009
Merge remote-tracking branch 'upstream/master' into air/lightning_rel…
woshiyyya Mar 24, 2023
afc510d
fix mutation config with feature dim
woshiyyya Mar 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions python/ray/train/lightning/lightning_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,8 @@ class LightningTrainer(TorchTrainer):
using the arguments provided in ``trainer_init_config`` and then run
``pytorch_lightning.Trainer.fit``.

TODO(yunxuanx): make this example testable

Example:
.. code-block:: python
.. testcode::
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

import torch
import torch.nn.functional as F
Expand All @@ -191,7 +189,8 @@ class LightningTrainer(TorchTrainer):
from torchvision import transforms
import pytorch_lightning as pl
from ray.air.config import ScalingConfig
from ray.train.lightning import LightningTrainer, LightningConfig
from ray.train.lightning import LightningTrainer, LightningConfigBuilder


class MNISTClassifier(pl.LightningModule):
def __init__(self, lr, feature_dim):
Expand All @@ -211,7 +210,7 @@ def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss)
self.log("train_loss", loss)
return loss

def validation_step(self, val_batch, batch_idx):
Expand All @@ -232,9 +231,8 @@ def configure_optimizers(self):
return optimizer

# Prepare MNIST Datasets
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
mnist_train = MNIST(
'./data', train=True, download=True, transform=transform
Expand All @@ -248,8 +246,8 @@ def configure_optimizers(self):
mnist_train = Subset(mnist_train, range(1000))
mnist_train = Subset(mnist_train, range(500))

train_loader = DataLoader(mnist_train, batch_size=32, shuffle=True)
val_loader = DataLoader(mnist_val, batch_size=32, shuffle=False)
train_loader = DataLoader(mnist_train, batch_size=128, shuffle=True)
val_loader = DataLoader(mnist_val, batch_size=128, shuffle=False)

lightning_config = (
LightningConfigBuilder()
Expand All @@ -267,6 +265,13 @@ def configure_optimizers(self):
scaling_config=scaling_config,
)
results = trainer.fit()
print(results)
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

.. testoutput::
:hide:
:options: +ELLIPSIS

...

Args:
lightning_config: Configuration for setting up the Pytorch Lightning Trainer.
Expand Down
18 changes: 18 additions & 0 deletions release/lightning_tests/app_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
base_image: {{ env["RAY_IMAGE_ML_NIGHTLY_GPU"] | default("anyscale/ray-ml:nightly-py37-gpu") }}

debian_packages:
- curl

python:
pip_packages:
- pytorch-lightning
- torch==1.11.0
conda_packages: []

post_build_cmds:
- echo {{ env["TIMESTAMP"] }}
- pip3 install -U --force-reinstall pytorch-lightning
- pip3 install --force-reinstall torch==1.11.0
- pip3 install --force-reinstall torchvision==0.12.0
- pip uninstall -y ray || true && pip3 install -U {{ env["RAY_WHEELS"] | default("ray") }}
- {{ env["RAY_WHEELS_SANITY_CHECK"] | default("echo No Ray wheels sanity check") }}
22 changes: 22 additions & 0 deletions release/lightning_tests/compute_tpl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
region: us-west-2

max_workers: 2
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

head_node_type:
name: head_node
instance_type: g3.8xlarge

worker_node_types:
- name: worker_node
instance_type: g3.8xlarge
min_workers: 2
max_workers: 2
use_spot: false

aws:
TagSpecifications:
- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
3 changes: 3 additions & 0 deletions release/lightning_tests/driver_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
torch==1.11.0
torchvision==0.12.0
pytorch-lightning
3 changes: 3 additions & 0 deletions release/lightning_tests/driver_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

pip install -U -r ./driver_requirements.txt
138 changes: 138 additions & 0 deletions release/lightning_tests/workloads/test_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
import os
import time
import json

import pytorch_lightning as pl

import torch
import torch.nn.functional as F
from torchmetrics import Accuracy
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.loggers.csv_logs import CSVLogger
from ray.air.config import ScalingConfig
from ray.train.lightning import LightningTrainer, LightningConfigBuilder


class MNISTClassifier(pl.LightningModule):
def __init__(self, lr, feature_dim):
super(MNISTClassifier, self).__init__()
self.fc1 = torch.nn.Linear(28 * 28, feature_dim)
self.fc2 = torch.nn.Linear(feature_dim, 10)
self.lr = lr
self.accuracy = Accuracy()

def forward(self, x):
x = x.view(-1, 28 * 28)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x

def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.functional.cross_entropy(y_hat, y)
self.log("train_loss", loss, on_step=True)
return loss

def validation_step(self, val_batch, batch_idx):
x, y = val_batch
logits = self.forward(x)
loss = F.nll_loss(logits, y)
acc = self.accuracy(logits, y)
return {"val_loss": loss, "val_accuracy": acc}

def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
avg_acc = torch.stack([x["val_accuracy"] for x in outputs]).mean()
self.log("ptl/val_loss", avg_loss, sync_dist=True)
self.log("ptl/val_accuracy", avg_acc, sync_dist=True)

def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
return optimizer


class MNISTDataModule(pl.LightningDataModule):
def __init__(self, batch_size=100):
super().__init__()
self.data_dir = os.getcwd()
self.batch_size = batch_size
self.transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)

def setup(self, stage=None):
# split data into train and val sets
if stage == "fit" or stage is None:
mnist = MNIST(
self.data_dir, train=True, download=True, transform=self.transform
)
self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000])

# assign test set for use in dataloader(s)
if stage == "test" or stage is None:
self.mnist_test = MNIST(
self.data_dir, train=False, download=True, transform=self.transform
)

def train_dataloader(self):
return DataLoader(self.mnist_train, batch_size=self.batch_size, num_workers=4)

def val_dataloader(self):
return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=4)

def test_dataloader(self):
return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=4)


def get_mnist_dataloaders():
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
mnist_train = MNIST("/tmp/data", train=True, download=True, transform=transform)
mnist_val = MNIST("/tmp/data", train=False, download=True, transform=transform)
train_loader = DataLoader(mnist_train, batch_size=32, shuffle=True)
val_loader = DataLoader(mnist_val, batch_size=32, shuffle=False)
return train_loader, val_loader


if __name__ == "__main__":
start = time.time()

lightning_config = (
LightningConfigBuilder()
.module(MNISTClassifier, feature_dim=128, lr=0.001)
.trainer(
max_epochs=3,
accelerator="gpu",
logger=CSVLogger("logs", name="my_exp_name"),
)
.fit_params(datamodule=MNISTDataModule(batch_size=128))
.checkpointing(monitor="ptl/val_accuracy", mode="max", save_last=True)
.build()
)

scaling_config = ScalingConfig(
num_workers=6, use_gpu=True, resources_per_worker={"CPU": 1, "GPU": 1}
)

trainer = LightningTrainer(
lightning_config=lightning_config,
scaling_config=scaling_config,
)

trainer.fit()

taken = time.time() - start
result = {
"time_taken": taken,
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
}
test_output_json = os.environ.get(
"TEST_OUTPUT_JSON", "/tmp/ray_lightning_user_test.json"
)
with open(test_output_json, "wt") as f:
json.dump(result, f)

print("Test Successful!")
24 changes: 24 additions & 0 deletions release/release_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,30 @@

alert: default

#######################
# Lightning tests
#######################

- name: lightning_trainer_test_latest
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
group: Lightning tests
working_dir: lightning_tests

frequency: nightly-3x
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
team: ml

cluster:
cluster_env: app_config.yaml
cluster_compute: compute_tpl.yaml

driver_setup: driver_setup.sh
run:
timeout: 1200
script: python workloads/test_trainer.py
wait_for_nodes:
num_nodes: 3

alert: default
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved


#######################
# ML user tests
Expand Down