MultiGPU training + changes to Checkpointing logic #218

prabhuteja12 · 2023-05-05T12:46:14Z

This is WIP to rework checkpointing and the multiGPU training in Renate.

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

lballes · 2023-05-05T13:49:03Z

examples/getting_started/renate_config.py

@@ -16,7 +16,7 @@ def __init__(self, num_hidden: int) -> None:
        # Model hyperparameters as well as the loss function need to registered via RenateModule's
        # constructor, see documentation. Otherwise, this is a standard torch model.
        super().__init__(
-            constructor_arguments={"num_hidden": num_hidden}, loss_fn=torch.nn.CrossEntropyLoss()
+            constructor_arguments={"num_hidden": num_hidden}


Should fit on single line?

examples/nlp_finetuning/renate_config.py

requirements.txt

src/renate/benchmark/models/base.py

src/renate/cli/parsing_functions.py

lballes · 2023-05-22T18:03:09Z

src/renate/utils/misc.py

+
+
+def int_or_str(x: str) -> Union[str, int]:
+    """Function to cast to int or str. This is used to tackle precision


Single first line of doc string.

src/renate/utils/misc.py

lballes · 2023-05-22T18:04:35Z

test/conftest.py

@@ -254,6 +270,11 @@ def get_renate_module_mlp(
    )


+@pytest.helpers.register


Seems overkill for a single line ;)

It is reused subsequently.

test/conftest.py

doc/getting_started/how_to_renate_config.rst

Signed-off-by: Prabhu Teja S <prabhuteja12@gmail.com>

examples/getting_started/renate_config.py

src/renate/updaters/experimental/er.py

src/renate/updaters/learner.py

lballes · 2023-05-24T17:28:06Z

src/renate/updaters/learner.py

-        """Returns the state of the learner."""
-        return {
+    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
+        learner_state_dict = {
            "learner_class_name": self.__class__.__name__,


It seems like this is not used anymore. I don't mind leaving it in, but we could remove it. Your call.

it isnt. But I left it in as a sanity check.

lballes · 2023-05-24T17:36:08Z

src/renate/updaters/learner.py

@@ -411,32 +379,42 @@ def __init__(
        **kwargs,
    ) -> None:
        super().__init__(seed=seed, **kwargs)
+        self.save_hyperparameters(


I am not sure about this. Based on playing with a toy example, I think the save_hyperparameters call in the base class might be enough. Let's talk about it offline.

lballes · 2023-05-24T17:37:14Z

src/renate/updaters/model_updater.py

-            pl_module.load_state_dict(self._model, torch.load(learner_state_path)["state_dict"])
+            loaded_state = trainer.strategy.load_checkpoint(learner_state_path)
+            pl_module.on_load_checkpoint(loaded_state)
+            # This loads the state dict only if its not Deepspeed.


Unsure what the comment refers to. Can you be more specific about which part is needed for Deepspeed?

lballes · 2023-05-24T17:39:51Z

src/renate/updaters/model_updater.py

        # Finalize model update.
        pl_module.on_model_update_end()
        # Save permanently.
        pl_module.save(self._output_state_folder)
        # Overwrite checkpoint.
        self._save_checkpoint(trainer, learner_state_path)

+    def teardown(self, trainer: Trainer, pl_module: LightningModule, stage: str) -> None:


A doc string would be helpful here. I don't exactly follow what we do here. Is the goal to have a separate file that just contains the model weights?

yes. added a detailed description

lballes · 2023-05-24T17:46:07Z

We need to check the doc strings. I saw that the RenateModule doc string still contains the loss_fn argument. Can you make sure that all doc strings reflect the change? I.e., remove it from RenateModule and its subclasses and add it to Learner and its subclasses.

Signed-off-by: Prabhu Teja S <prabhuteja12@gmail.com>

…ointing

prabhuteja12 added 19 commits April 24, 2023 14:23

strategy maker

6627ae6

strategy maker with license

67ac54c

Merge branch 'main' into enable_multigpu

3bf62aa

tweaks to strategy maker

e965bb2

interfaces changes to allow precision, strategy

45bdd22

adding defaults

375944a

argparse additions

1404843

nlp example to local

557a8e5

precision, utils, rank_zero prints

f51aae2

simplified delete folder on rank zero

5eecf14

moving loss to learner

89766fa

Merge remote-tracking branch 'origin/dev' into checkpointing

671cada

distributed strategy flags

cb4b10d

utils cleanup

d83374f

adding strategy and precision to exps

5f20691

loss fn in experiment config

3a60ca8

converting lambda to function for mp spawn

fd61b3f

changing state dicts to checkpoint management

0c9bebc

replace manual saving by ModelCheckpoint

eda2891

prabhuteja12 requested review from lballes and wistuba May 5, 2023 12:46

prabhuteja12 added 9 commits May 6, 2023 15:07

eval with only one device

9d22177

adding device, strategy

0ac6fad

merging dev

2103b73

hyperparameter exceptions to checkpointing

d97685f

hyperparameter exceptions to checkpointing

14fa03c

pickling extra args for deepspeed compat

29b3f98

loss_fn tests and misc

7fda217

avalanche changes to enable device, loss, prec

b17ac3d

changes to tests

c0d9e59

prabhuteja12 added 3 commits May 22, 2023 19:28

addressing misc comments

99f0174

reducing max_time

359a7c7

fixing linting erros

f9f1ecb

lballes suggested changes May 22, 2023

View reviewed changes

prabhuteja12 and others added 9 commits May 22, 2023 21:10

addressing comments

de02ae2

doc changes for loss_fn

66e5cec

nlp documentation

e49da0a

reorganizing utils funcs

a5f72bb

removing post init and set_ funcs, tests

4547dd3

flake8

10f65fc

Merge branch 'dev' into checkpointing

6f9a515

Signed-off-by: Prabhu Teja S <prabhuteja12@gmail.com>

increasing max time

d1337a1

simplifying deletion and rank zero

98055bd

lballes suggested changes May 24, 2023

View reviewed changes

prabhuteja12 added 3 commits May 25, 2023 12:20

removing mentions of loss fn in docstrings of models

02c89f6

removing save hyperparam call in child learners

c4051d3

detailed docstrings for model checkpoint callback

3291d7a

lballes previously approved these changes May 25, 2023

View reviewed changes

Merge branch 'dev' into checkpointing

1eea37a

Signed-off-by: Prabhu Teja S <prabhuteja12@gmail.com>

prabhuteja12 dismissed lballes’s stale review via 1eea37a May 25, 2023 11:15

lballes self-requested a review May 25, 2023 11:33

lballes previously approved these changes May 25, 2023

View reviewed changes

prabhuteja12 added 2 commits May 25, 2023 14:59

sphinx build errors for teardown

757bf8e

Merge branch 'checkpointing' of github.com:awslabs/Renate into checkp…

c486c4b

…ointing

prabhuteja12 dismissed lballes’s stale review via c486c4b May 25, 2023 13:00

lballes self-requested a review May 25, 2023 13:12

lballes approved these changes May 25, 2023

View reviewed changes

wistuba merged commit 19a2271 into dev May 25, 2023

wistuba deleted the checkpointing branch May 25, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU training + changes to Checkpointing logic #218

MultiGPU training + changes to Checkpointing logic #218

prabhuteja12 commented May 5, 2023

lballes May 5, 2023

lballes May 22, 2023

lballes May 22, 2023

prabhuteja12 May 22, 2023

lballes May 24, 2023

prabhuteja12 May 24, 2023

lballes May 24, 2023

lballes May 24, 2023

lballes May 24, 2023

prabhuteja12 May 25, 2023

lballes commented May 24, 2023



		def int_or_str(x: str) -> Union[str, int]:
		"""Function to cast to int or str. This is used to tackle precision

		@@ -254,6 +270,11 @@ def get_renate_module_mlp(
		)


		@pytest.helpers.register

MultiGPU training + changes to Checkpointing logic #218

MultiGPU training + changes to Checkpointing logic #218

Conversation

prabhuteja12 commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lballes commented May 24, 2023