-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CometLogger for metrics logging via comet.ml #1221
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1221
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 27c90e3 with merge base 0057fe7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @dzheng256! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
@dzheng256 Would you mind signing the CLA agreement? |
Hey @joecummings I actually already had an email exchange with the meta CLA folks about this and I think I'm on the agreement already but it might be my corporate email (danz@netflix.com) rather than the account I'm making this PR from (dzheng256@gmail.com). I've asked them to add my personal email to the corporate CLA as well but haven't heard back. |
It seems like the tests ran and passed, @joecummings does that mean the CLA issue is resolved? |
@dzheng256 It seems the CLA check is still failing. Have you confirmed that your personal email was added? Are you still able to sign a second time with your personal email? |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
@joecummings sorry about that, I signed again with my personal email and it seems like CLA issues are resolved now. |
Thank you so much for adding this @dzheng256. It's really cool to hear that you're using torchtune in your company. Could you possibly attach an example comet logging output to the PR to confirm everything looks GOOD? |
Could you also please shout out integration in the README.md here, please? Line 35 in 288ff44 |
Thanks for adding this. There was also an earlier PR to add CometLogger that including a quick tutorial on how to use it. Would you be able to add that in this PR as well? #910 |
Oh sorry, I hadn't realized there was an open PR implementing pretty much the same thing as this one. Should we stick with #910 then, since it came first? It also seems to be written by the a Comet dev so it might utilize the Comet API better. cc @Lothiraldan |
@dzheng256 I've pinged the other PR to see if he's still willing to work on it. If not, maybe you can incorporate the differences here? |
Yes sounds good, I will pull in the tutorial from the other PR if they no longer have the bandwidth to work on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, I will give it a try in the next days to generate an example public project.
@dzheng256 Feel free to pull the tutorial from my PR https://github.com/pytorch/torchtune/pull/910/files#diff-5c3c14a3a87e5106e63f046499c113cd5b5c0771ac548bcb5b097fc40342bd44, I don't think I have the permission to push to your fork. Once I have a new example project, I will share the links to update the tutorial
There is one open question, whether to use our new way of initializing Comet Experiments. This is a new way but it's about to be more globally released, we have it stable for a while now and we already updated several of our integrations to use it, last one was Transformers: huggingface/transformers#31366
torchtune/utils/metric_logging.py
Outdated
self.experiment = Experiment( | ||
project_name=project_name, | ||
workspace=workspace, | ||
log_code=log_code, | ||
**kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving towards a new way of initializing Comet Experiment with a new top-level function: comet_ml.start
. This has several benefits:
- It support online and offline logging
- It support logging a new experiment or appending to an existing Experiment
- You can set name and tags directly at initialization time
I'm working right now on finalizing the documentation but I can share the reference doc for that new function already: https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/Experiment-Creation/
If there is a chance that users might want to append data to an existing experiment or create log data to an offline Experiment (to upload later), I would recommend using that instead. Please let me know, happy to provide the code corresponding code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Lothiraldan, I swapped it over to the new comet_ml.start() and verified the latest code works for me. Could you use it to make an example project which I can then link in the deep dive tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lothiraldan I also added you as a collaborator to my fork, feel free to directly push changes if you would like!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sent an update and I'm working on running a new example once I get my hands on a EC2 GPU instance
Co-authored-by: Boris Feld <lothiraldan@gmail.com>
torchtune/utils/metric_logging.py
Outdated
api_key (Optional[str]): Comet API key. It's recommended to configure the API Key from the environment. | ||
workspace (Optional[str]): Comet workspace name. If not provided, uses the default workspace. | ||
project (Optional[str]): Comet project name. Defaults to Uncategorized. | ||
experiment_name (Optional[str]): The name for comet experiment to be used for logging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experiment_name (Optional[str]): The name for comet experiment to be used for logging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still unhappy about this @Lothiraldan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm working on understanding why is it complaining. It only is complaining with the latest version of pydocstring.
torchtune/utils/metric_logging.py
Outdated
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally | ||
in offline experiment. Default is `True`. | ||
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing | ||
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful | ||
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to | ||
an existing one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally | |
in offline experiment. Default is `True`. | |
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing | |
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful | |
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to | |
an existing one. | |
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing | |
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful | |
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to | |
an existing one. | |
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally | |
in offline experiment. Default is `True`. |
nit: ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was wondering why the linting was failing initially.
@SalmanMohammadi I got back access to the EC2 instance and I've got the example running: https://www.comet.com/examples/comet-example-torchtune-mistral/fffb3036880e41b5af2df932db4d3578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step. I took a screenshot while it's running, it will probably looks better after it finishes but we can merge with the current screenshot. I made some small changes and added back logging the configuration as a file, this way it's pretty easy to grab it from Comet and reproduce a previous run. I made several small commits, sorry about that. Let me know if you want me to squash them if you are not gonna squash the PR. |
Amazing!! Thank you. Feel free to send me an updated screenshot on Discord and I'm happy to add. |
torchtune/utils/metric_logging.py
Outdated
experiment_name (Optional[str]): Name of the experiment. If not provided, Comet will auto-generate a name. | ||
tags (Optional[List[str]]): Tags to associate with the experiment. | ||
log_code (bool): Whether to log the source code. Defaults to True. | ||
**kwargs: additional arguments to pass to Comet.start. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to type this (annoyingly)
**kwargs: additional arguments to pass to Comet.start. See | |
**kwargs: additional arguments to pass to Comet.start. See |
torchtune/utils/metric_logging.py
Outdated
experiment_name: Optional[str] = None, | ||
tags: Optional[List[str]] = None, | ||
log_code: bool = True, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. I have a lot of nit comments around RST docs rendering stuff, hope you don't mind :)
* How to use the :class:`~torchtune.utils.metric_logging.CometLogger` | ||
* How to log configs, metrics, and model checkpoints to Comet | ||
|
||
Torchtune supports logging your training runs to `Comet <https://www.comet.com/site/?utm_source=torchtune&utm_medium=docs&utm_content=docs>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we lowercase torchtune everywhere, even at the start of a sentence :)
|
||
# enable logging to the built-in CometLogger | ||
metric_logger: | ||
_component_: torchtune.utils.metric_logging.CometLogger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
|
||
.. note:: | ||
|
||
Click on this sample `Comet project to see how it will looks like after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click on this sample `Comet project to see how it will looks like after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_. | |
Click on this sample `Comet project to see the logged metrics after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_. |
torchtune/utils/metric_logging.py
Outdated
experiment_key (Optional[str]): The key for comet experiment to be used for logging. Must be an alphanumeric | ||
string whose length is between 32 and 50 characters. | ||
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing | ||
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful | |
experiment identified by the ``experiment_key`` value. "create": Always creates of a new experiment, useful |
torchtune/utils/metric_logging.py
Outdated
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to | ||
an existing one. | ||
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally | ||
in offline experiment. Default is `True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in offline experiment. Default is `True`. | |
in an offline experiment. Default is ``True``. |
torchtune/utils/metric_logging.py
Outdated
experiment_name (Optional[str]): Name of the experiment. If not provided, Comet will auto-generate a name. | ||
tags (Optional[List[str]]): Tags to associate with the experiment. | ||
log_code (bool): Whether to log the source code. Defaults to True. | ||
**kwargs: additional arguments to pass to Comet.start. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwargs: additional arguments to pass to Comet.start. See | |
**kwargs: additional arguments to pass to ``Comet.start``. See |
torchtune/utils/metric_logging.py
Outdated
>>> logger.close() | ||
|
||
Raises: | ||
ImportError: If `comet_ml` package is not installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImportError: If `comet_ml` package is not installed. | |
ImportError: If ``comet_ml`` package is not installed. |
torchtune/utils/metric_logging.py
Outdated
You can install it with ``pip install comet_ml``. | ||
You need to set up your Comet.ml API key before using this logger. | ||
You can do this by setting the COMET_API_KEY environment variable | ||
or by calling ``comet_ml.login()`` with your API key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I find comet login the easiest to do and is more comfortable for folks who don't know much code. What are your thoughts on just always recommending to setup via comet login so folks do not get confused with the different options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comet login
works great in interactive situations but won't help in automated setups. I'm happy to highlight the comet login
option and links to other options in the Comet docs if that works for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in #1221 (comment), if we expect users to be setting up a config and doing tune run ...
how are they supposed to use comet.login()
? We would have to call that inside the logger, and incorporate some API key via a secrets manager solution which would be different for every user.
As far as I can tell torchtune seems to heavily favor the CLI tune run
style usage, rather than importing it and calling tune run programmatically. If this is indeed the case, I think we should actually remove the mention to comet.login()
and have people export the api key, what do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was under the impression you can run comet login
from CLI, is this not the case?
Either way I agree with @dzheng256, since CLI is the primary entry point for most users via tune run
I would lean towards suggesting whatever fits that workflow best whether that's setting environment variables or a CLI command. I suppose for automated setups environment variable is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RdoubleA @dzheng256 Sorry about the confusion, there is bot a comet login
command-line script and comet_ml.login
that one can call in their Python script. I will clean up and double-check everything to point people to use the CLI comet login
.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1221 +/- ##
===========================================
- Coverage 67.81% 27.13% -40.68%
===========================================
Files 219 255 +36
Lines 9908 11799 +1891
===========================================
- Hits 6719 3202 -3517
- Misses 3189 8597 +5408 ☔ View full report in Codecov by Sentry. |
class TestCometLogger: | ||
def test_log(self) -> None: | ||
with patch("comet_ml.Experiment") as mock_experiment: | ||
logger = CometLogger(project_name="test_project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger = CometLogger(project_name="test_project") | |
logger = CometLogger(project="test_project") |
|
||
def test_log_dict(self) -> None: | ||
with patch("comet_ml.Experiment") as mock_experiment: | ||
logger = CometLogger(project_name="test_project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger = CometLogger(project_name="test_project") | |
logger = CometLogger(project="test_project") |
|
||
def test_log_config(self) -> None: | ||
with patch("comet_ml.Experiment") as mock_experiment: | ||
logger = CometLogger(project_name="test_project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger = CometLogger(project_name="test_project") | |
logger = CometLogger(project="test_project") |
Also update the last uppercase references to Torchtune
Hi @SalmanMohammadi, I've pushed fixes for the unit tests, updated the links and screenshots, tried to clarify the docstring about I think that we are all good now but please let me know if I missed anything |
README.md
Outdated
@@ -30,6 +30,7 @@ torchtune focuses on integrating with popular tools and libraries from the ecosy | |||
- [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) for distributed training | |||
- [torchao](https://github.com/pytorch-labs/ao) for lower precision dtypes and [post-training quantization](recipes/quantize.py) techniques | |||
- [Weights & Biases](https://wandb.ai/site) for [logging](https://pytorch.org/torchtune/main/deep_dives/wandb_logging.html) metrics and checkpoints, and tracking training progress | |||
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/stable/api_ref_utilities.html#metric-logging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/stable/api_ref_utilities.html#metric-logging) | |
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/main/deep_dives/comet_logging.html) |
one last minor nit, I promise
Thank you so much for your patience in addressing all our comments, and for your contributions @dzheng256, @Lothiraldan. This looks fantastic to me. One last tiny nit and feel free to merge once CI is green. |
@SalmanMohammadi Good catch, thanks you and @RdoubleA for all of your time and reviews. I hope that my millions of small commits were not spamming your inbox too much 😅 |
Thanks everyone! Looking forward to deleting my internal CometLogger and switching to |
Context
What is the purpose of this PR? Is it to
My company uses torchtune for LLM finetuning but we use comet.ml for experiment tracking, not tensorboard or weights and biases. Adding a
CometLogger
class to make it easier for other Comet users to use torchtune with Comet.Changelog
CometLogger
classCometLogger
TestCometLogger
testTest plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)
pre-commit install
)pytest tests
pytest tests -m integration_test