Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Wandb experience #660

Merged
merged 80 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a3df205
enable W&B
Mar 25, 2024
3f41234
add default project
Mar 25, 2024
914a901
Delete old checkpoint code (#601)
kartikayk Mar 27, 2024
54a5e2a
fix typo (#606)
yechenzhi Mar 28, 2024
de155dc
Chat dataset + SlimOrca refactor + more templates (#576)
RdoubleA Mar 28, 2024
97b994d
Add Acknowledgements (#613)
kartikayk Mar 29, 2024
08fae5e
Full finetune < 16GB (#527)
rohan-varma Mar 29, 2024
ae600b2
Small fix to README for full finetune (#615)
rohan-varma Mar 29, 2024
290beb5
Add `tune run` and refactor CLI (#586)
joecummings Mar 29, 2024
dc6e54d
Fix typos in Acknowledgements section (#617)
joecummings Mar 29, 2024
a804c23
HuggingFace --> Hugging Face (#618)
joecummings Mar 29, 2024
0217bfa
Configure max_seq_len in InstructDataset (#620)
RdoubleA Mar 30, 2024
98ae830
Inference (#619)
kartikayk Mar 31, 2024
f60ebb2
Print out "Ignoring patterns" for download (#625)
joecummings Mar 31, 2024
ba93269
[Fix] Update the tune command to kick off training in yaml files (#628)
SLR722 Apr 1, 2024
ee2f82b
Remove conversion script (#629)
joecummings Apr 1, 2024
83660cd
Add Mistral models to recipe registry (#631)
joecummings Apr 1, 2024
97381a7
Fix first_finetune tutorial (#634)
kartikayk Apr 2, 2024
86c6ee4
update license (#635)
kartikayk Apr 2, 2024
ec3d93e
Remove _copy_tensor from usage (#633)
rohan-varma Apr 2, 2024
4c6460f
Add fp32 support for QLoRA (#595)
rohan-varma Apr 2, 2024
72dd372
Add link to docs from README (#636)
NicolasHug Apr 2, 2024
d876889
Refactor datasets and tokenizer (#624)
ebsmothers Apr 2, 2024
0770781
Split alpaca_dataset to alpaca + alpaca_cleaned (#639)
RdoubleA Apr 2, 2024
f085a77
Add weights_only flag to torchtune checkpointer (#642)
kartikayk Apr 2, 2024
34accd9
add missing tokenize_messages docstring (#643)
ebsmothers Apr 2, 2024
07d3813
Add string to InstructTemplate, ChatFormat getters (#641)
RdoubleA Apr 3, 2024
7fab51f
Add ``include_package_data`` to setuptools (#649)
joecummings Apr 3, 2024
96ecf28
Add verification of llama model access in first_finetune_tutorial.rst…
iseeyuan Apr 4, 2024
77eb695
grad accum in LoRA distributed recipe (#644)
ebsmothers Apr 4, 2024
76c21b7
Gemma (#630)
solitude-alive Apr 4, 2024
cba0560
Validate messages (#647)
RdoubleA Apr 4, 2024
98f82e5
[Perf Tools] Torch profiler component (#627)
SLR722 Apr 4, 2024
e97720a
Adding quantization support in torchtune (#632)
jerryzh168 Apr 5, 2024
1162295
tiny llama
Apr 5, 2024
99283ae
fix tiny config
Apr 5, 2024
6e1bfcc
updated recipe tiny
Apr 5, 2024
2ff4db7
Merge remote-tracking branch 'upstream/main' into wandb
tcapelle Apr 8, 2024
f73b4d7
grab yaml used on input
tcapelle Apr 9, 2024
6a888ac
update the wandb.config with the re-loaded yaml
tcapelle Apr 9, 2024
90ebe53
undo changes
tcapelle Apr 9, 2024
c0f81ae
some clean up logic
tcapelle Apr 9, 2024
c80c7ab
walrus :)
tcapelle Apr 9, 2024
fedfd9c
Merge remote-tracking branch 'upstream/main' into wandb
tcapelle Apr 11, 2024
6880cba
remove nb
tcapelle Apr 11, 2024
1ddbe7a
add docs
tcapelle Apr 11, 2024
ad42a35
refactor memory so it's loggable
tcapelle Apr 11, 2024
2af532a
put checkpoint logic as a tutorial
tcapelle Apr 11, 2024
c8c771e
refactor memory logging
tcapelle Apr 12, 2024
8518e26
dump omegaconf as ground true
tcapelle Apr 12, 2024
b2ab64c
revert to return a dict
tcapelle Apr 12, 2024
b45edeb
Merge remote-tracking branch 'upstream/main' into wandb
tcapelle Apr 12, 2024
c251b59
rename to log_config
tcapelle Apr 12, 2024
a9f43fe
undo lora stuff
tcapelle Apr 12, 2024
7b66488
remove tiny llama
tcapelle Apr 12, 2024
4985758
add distributed training logics
tcapelle Apr 12, 2024
3ed657c
compute number of params
tcapelle Apr 12, 2024
5cc577a
use provided tiny llama
tcapelle Apr 12, 2024
b64300f
add logging
tcapelle Apr 12, 2024
3d18e69
typos
tcapelle Apr 12, 2024
0b7c13a
remove big yaml
tcapelle Apr 12, 2024
80bd373
add tok
tcapelle Apr 12, 2024
29a5252
update memory logging logic
tcapelle Apr 12, 2024
c040735
fix output class
tcapelle Apr 12, 2024
7069498
remove unused imports
tcapelle Apr 12, 2024
f319fbf
better docstrings
tcapelle Apr 12, 2024
8202bc9
integrate logging to other recipes
tcapelle Apr 12, 2024
09f59e8
remove tiny llama
tcapelle Apr 12, 2024
ce42cc8
destroy on rank zero
tcapelle Apr 12, 2024
cf8a948
missing tab
tcapelle Apr 12, 2024
bfb8e98
missing tab
tcapelle Apr 12, 2024
820bef1
literal typing
tcapelle Apr 12, 2024
8fee701
add wandb workspace screenshot
tcapelle Apr 12, 2024
50cc6a8
Update docs/source/examples/wandb_logging.rst
tcapelle Apr 12, 2024
f3fe9e5
Update torchtune/utils/metric_logging.py
tcapelle Apr 12, 2024
fffba10
remove multi-node logi
tcapelle Apr 12, 2024
52c2207
Merge remote-tracking branch 'origin/wandb' into wandb
tcapelle Apr 12, 2024
1256087
run linter, only log memory stats with cuda
RdoubleA Apr 15, 2024
97e8aa3
fix docs, link deep dive in index
RdoubleA Apr 15, 2024
55513f7
address comments
RdoubleA Apr 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/_static/img/torchtune_workspace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
79 changes: 79 additions & 0 deletions docs/source/examples/wandb_logging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
.. _wandb_logging:

===========================
Logging to Weights & Biases
===========================

.. customcarditem::
:header: Logging to Weights & Biases
:card_description: Log metrics and model checkpoints to W&B
:image: _static/img/torchtune_workspace.png
:link: examples/wandb_logging.html
:tags: logging,wandb


Torchtune supports logging your training runs to [Weights & Biases](https://wandb.ai).

.. note::

You will need to install the `wandb`` package to use this feature.
You can install it via pip:

.. code-block:: bash

pip install wandb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a "tip" to run wandb login before running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch


Metric Logger
-------------

The only change you need to make is to add the metric logger to your config. Weights & Biases will log the metrics and model checkpoints for you.

.. code-block:: yaml

# enable logging to the built-in WandBLogger
metric_logger:
_component_: torchtune.utils.metric_logging.WandBLogger
# the W&B project to log to
project: torchtune


We automatically grab the config from the recipe you are running and log it to W&B. You can find it in the W&B overview tab and the actual file in the `Files` tab.

.. note::

Click on this sample [project to see the W&B workspace](https://wandb.ai/capecape/torchtune)
The config used to train the models can be found [here](https://wandb.ai/capecape/torchtune/runs/6053ofw0/files/torchtune_config_j67sb73v.yaml)
tcapelle marked this conversation as resolved.
Show resolved Hide resolved

Logging Model Checkpoints to W&B
--------------------------------

You can also log the model checkpoints to W&B by modifying the desired script `save_checkpoint` method.

A suggested approach would be something like this:

.. code-block:: python

def save_checkpoint(self, epoch: int) -> None:
...
## Let's save the checkpoint to W&B
## depending on the Checkpointer Class the file will be named differently
## Here is an example for the full_finetune case
checkpoint_file = Path.joinpath(
self._checkpointer._output_dir, f"torchtune_model_{epoch}"
).with_suffix(".pt")
wandb_at = wandb.Artifact(
name=f"torchtune_model_{epoch}",
type="model",
# description of the model checkpoint
description="Model checkpoint",
# you can add whatever metadata you want as a dict
metadata={
utils.SEED_KEY: self.seed,
utils.EPOCHS_KEY: self.epochs_run,
utils.TOTAL_EPOCHS_KEY: self.total_epochs,
utils.MAX_STEPS_KEY: self.max_steps_per_epoch,
}
)
wandb_at.add_file(checkpoint_file)
wandb.log_artifact(wandb_at)
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ TorchTune tutorials.
examples/checkpointer
examples/configs
examples/recipe_deepdive
examples/wandb_logging

.. toctree::
:glob:
Expand Down
25 changes: 15 additions & 10 deletions recipes/full_finetune_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,11 @@ def setup(self, cfg: DictConfig) -> None:
Sets up the recipe state correctly. This includes setting recipe attributes based
on the ``resume_from_checkpoint`` flag.
"""
self._metric_logger = config.instantiate(cfg.metric_logger)
if self._is_rank_zero:
tcapelle marked this conversation as resolved.
Show resolved Hide resolved
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)

ckpt_dict = self.load_checkpoint(cfg.checkpointer)

Expand Down Expand Up @@ -266,12 +270,9 @@ def _setup_model(
utils.set_activation_checkpointing(
model, auto_wrap_policy={modules.TransformerDecoderLayer}
)
if self._is_rank_zero:
log.info(
utils.memory_stats_log(
"Memory Stats after model init", device=self._device
)
)
if self._is_rank_zero and self._device == torch.device("cuda"):
memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")

# synchronize before training begins
torch.distributed.barrier()
Expand Down Expand Up @@ -450,16 +451,20 @@ def train(self) -> None:
if (
self.total_training_steps % self._log_peak_memory_every_n_steps == 0
and self._is_rank_zero
and self._device == torch.device("cuda")
):
log.info(
utils.memory_stats_log("Memory Stats", device=self._device)
# Log peak memory for iteration
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)

self.epochs_run += 1
self.save_checkpoint(epoch=curr_epoch)

def cleanup(self) -> None:
self._metric_logger.close()
if self._is_rank_zero:
self._metric_logger.close()
torch.distributed.destroy_process_group()


Expand Down
21 changes: 13 additions & 8 deletions recipes/full_finetune_single_device.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ def setup(self, cfg: DictConfig) -> None:
"""
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)
tcapelle marked this conversation as resolved.
Show resolved Hide resolved

ckpt_dict = self.load_checkpoint(cfg.checkpointer)

# ``_setup_model`` handles initialization and loading the state dict. This method
Expand Down Expand Up @@ -231,11 +234,9 @@ def _setup_model(
if compile_model:
log.info("Compiling model with torch.compile...")
model = utils.wrap_compile(model)
log.info(
utils.memory_stats_log(
"Memory Stats after model init:", device=self._device
)
)
if self._device == torch.device("cuda"):
memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")
return model

def _setup_optimizer(
Expand Down Expand Up @@ -414,9 +415,13 @@ def train(self) -> None:
self.total_training_steps += 1

# Log peak memory for iteration
if self.total_training_steps % self._log_peak_memory_every_n_steps == 0:
log.info(
utils.memory_stats_log("Memory Stats:", device=self._device)
if (
self.total_training_steps % self._log_peak_memory_every_n_steps == 0
and self._device == torch.device("cuda")
):
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)
self.epochs_run += 1
self.save_checkpoint(epoch=curr_epoch)
Expand Down
26 changes: 15 additions & 11 deletions recipes/gemma_full_finetune_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,11 @@ def setup(self, cfg: DictConfig) -> None:
Sets up the recipe state correctly. This includes setting recipe attributes based
on the ``resume_from_checkpoint`` flag.
"""
self._metric_logger = config.instantiate(cfg.metric_logger)
if self._is_rank_zero:
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)

ckpt_dict = self.load_checkpoint(cfg.checkpointer)

Expand Down Expand Up @@ -262,13 +266,9 @@ def _setup_model(
utils.set_activation_checkpointing(
model, auto_wrap_policy={modules.TransformerDecoderLayer}
)
if self._is_rank_zero:
log.info(
utils.memory_stats_log(
"Memory Stats after model init", device=self._device
)
)

if self._is_rank_zero and self._device == torch.device("cuda"):
memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")
# synchronize before training begins
torch.distributed.barrier()

Expand Down Expand Up @@ -457,16 +457,20 @@ def train(self) -> None:
if (
self.total_training_steps % self._log_peak_memory_every_n_steps == 0
and self._is_rank_zero
and self._device == torch.device("cuda")
):
log.info(
utils.memory_stats_log("Memory Stats", device=self._device)
# Log peak memory for iteration
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)

self.epochs_run += 1
self.save_checkpoint(epoch=curr_epoch)

def cleanup(self) -> None:
self._metric_logger.close()
if self._is_rank_zero:
self._metric_logger.close()
torch.distributed.destroy_process_group()


Expand Down
22 changes: 14 additions & 8 deletions recipes/lora_dpo_single_device.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,9 @@ def setup(self, cfg: DictConfig) -> None:
"""
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)

checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)

self._model = self._setup_model(
Expand Down Expand Up @@ -252,11 +255,9 @@ def _setup_model(
)

log.info(f"Model is initialized with precision {self._dtype}.")
log.info(
utils.memory_stats_log(
"Memory Stats after model init:", device=self._device
)
)
if self._device == torch.device("cuda"):
memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")
return model

def _setup_optimizer(
Expand Down Expand Up @@ -490,9 +491,14 @@ def train(self) -> None:
# Update the number of steps when the weights are updated
self.total_training_steps += 1
# Log peak memory for iteration
if self.total_training_steps % self._log_peak_memory_every_n_steps == 0:
log.info(
utils.memory_stats_log("Memory Stats:", device=self._device)
if (
self.total_training_steps % self._log_peak_memory_every_n_steps == 0
and self._device == torch.device("cuda")
):
# Log peak memory for iteration
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)
self.epochs_run += 1
self.save_checkpoint(epoch=curr_epoch)
Expand Down
19 changes: 11 additions & 8 deletions recipes/lora_finetune_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,9 @@ def setup(self, cfg: DictConfig) -> None:
if self._is_rank_zero:
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)

checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)

self._model = self._setup_model(
Expand Down Expand Up @@ -323,12 +326,9 @@ def _setup_model(
utils.set_activation_checkpointing(
model, auto_wrap_policy={modules.TransformerDecoderLayer}
)
if self._is_rank_zero:
log.info(
utils.memory_stats_log(
"Memory Stats after model init:", device=self._device
)
)
if self._is_rank_zero and self._device == torch.device("cuda"):
memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")

# synchronize before training begins
torch.distributed.barrier()
Expand Down Expand Up @@ -541,9 +541,12 @@ def train(self) -> None:
if (
self.total_training_steps % self._log_peak_memory_every_n_steps == 0
and self._is_rank_zero
and self._device == torch.device("cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this one too? For distributed tests they should only run on GPU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yeah I can remove this check for distributed recipes

):
log.info(
utils.memory_stats_log("Memory Stats:", device=self._device)
# Log peak memory for iteration
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)

self.epochs_run += 1
Expand Down
19 changes: 12 additions & 7 deletions recipes/lora_finetune_single_device.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ def setup(self, cfg: DictConfig) -> None:
model, tokenizer, loss, optimizer, learning rate scheduler, sampler, and dataloader.
"""
self._metric_logger = config.instantiate(cfg.metric_logger)

# log config with parameter override
self._metric_logger.log_config(cfg)

self._model_compile = cfg.compile
checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)

Expand Down Expand Up @@ -263,11 +267,9 @@ def _setup_model(
if compile_model:
log.info("Compiling model with torch.compile...")
model = utils.wrap_compile(model)
log.info(
utils.memory_stats_log(
"Memory Stats after model init:", device=self._device
)
)
if self._device == torch.device("cuda"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the CPU recipe tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this actually wont throw the log just doesn't print anything

memory_stats = utils.memory_stats_log(device=self._device)
log.info(f"Memory Stats after model init:\n{memory_stats}")
return model

def _setup_optimizer(
Expand Down Expand Up @@ -446,9 +448,12 @@ def train(self) -> None:
if (
self.total_training_steps % self._log_peak_memory_every_n_steps
== 0
and self._device == torch.device("cuda")
):
log.info(
utils.memory_stats_log("Memory Stats:", device=self._device)
# Log peak memory for iteration
memory_stats = utils.memory_stats_log(device=self._device)
self._metric_logger.log_dict(
memory_stats, step=self.total_training_steps
)
self.epochs_run += 1
self.save_checkpoint(epoch=curr_epoch)
Expand Down
Loading
Loading