Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

Open
guillaumeguy opened this issue Oct 4, 2024 · 2 comments
Assignees
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues

Comments

@guillaumeguy
Copy link

guillaumeguy commented Oct 4, 2024

Description

We are using MLFlow callback in the way prescribed by the documentation.

There are a few missing things:
1. MLFlow allows the addition of System Tag for note (i.e. "this trial changes this param for reason X,Y,Z"). However, Ray doesn't support it and instead, the tag gets passed as a regular tag.
Code:

 callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=os.environ["MLFLOW_TRACKING_URI"],
                    experiment_name=base_config["experiment"]["name"],
                    save_artifact=True,
                    tags={'note.content' : 'this trial changes this param for reason X,Y,Z'}
                )

2. Could Ray also pick up the user name to populate the UI? Now, it shows as unknown. Once again, it can't be overwritten
image~~

  1. Is it possible to make the checkpoint also an MLFlow artifact? Today, the user is responsible to call the MLFlow backend, get the trial name and figure out a way to get to the artifact (we save it in S3) which is non-obvious because Ray adds a few random number to the trial name (TorchTrainer_b6f96_00000 may become TorchTrainer_b6f96_00000_5439_4343 for instance)

Checkpointing code:

        # Save checkpoint
        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            from ray.train import Checkpoint

            checkpoint = None
            if global_rank == 0 and (
                (epoch + 1) % config["training"]["save_every"] == 0
                or epoch == EPOCHS - 1
            ):
                # This saves the model to the trial directory
                torch.save(
                    model.state_dict(), os.path.join(temp_checkpoint_dir, "model.pth")
                )
                # pickle the vocab
                import pickle

                with open(
                    os.path.join(temp_checkpoint_dir, "multi_vocab.pkl"), "wb"
                ) as f:
                    pickle.dump(multi_vocab, f)

                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            print("sending checkpoint data")
            # Send the current training result back to Tune
            ray.train.report(eval_perf, checkpoint=checkpoint)
            print("Done sending")

Use case

Better integration with ML Flow

@guillaumeguy guillaumeguy added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 4, 2024
@guillaumeguy
Copy link
Author

Ah, Found the solution for #1 and #2. It needs the mlflow prefix:

                MLflowLoggerCallback(
                    tracking_uri=os.environ["MLFLOW_TRACKING_URI"],
                    experiment_name=base_config["experiment"]["name"],
                    save_artifact=True,
                    tags={'mlflow.note.content' : 'this trial changes this param for reason X,Y,Z', 'mlflow.user': 'XXX'}
                )

@anyscalesam anyscalesam added the tune Tune-related issues label Oct 7, 2024
@justinvyu justinvyu added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 10, 2024
@justinvyu justinvyu self-assigned this Oct 10, 2024
@CtrlMj
Copy link

CtrlMj commented Nov 30, 2024

I am facing the same issue.
It seems that it used to be possible to find a desired ray checkpoint directly from mlflow run_id in the deprecated ray[air] version like so:

from ray.air import Result
from urllib.parse import urlparse
artifact_dir = urlparse(mlflow.get_run(run_id).info.artifact_uri).path  # get path from mlflow
results = Result.from_path(artifact_dir)
results.best_checkpoints[0][0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

4 participants