[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

guillaumeguy · 2024-10-04T18:01:41Z

Description

We are using MLFlow callback in the way prescribed by the documentation.

There are a few missing things:
~~1. MLFlow allows the addition of System Tag for note (i.e. "this trial changes this param for reason X,Y,Z"). However, Ray doesn't support it and instead, the tag gets passed as a regular tag.~~
Code:

 callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=os.environ["MLFLOW_TRACKING_URI"],
                    experiment_name=base_config["experiment"]["name"],
                    save_artifact=True,
                    tags={'note.content' : 'this trial changes this param for reason X,Y,Z'}
                )

~~2. Could Ray also pick up the user name to populate the UI? Now, it shows as unknown. Once again, it can't be overwritten~~
~~

Is it possible to make the checkpoint also an MLFlow artifact? Today, the user is responsible to call the MLFlow backend, get the trial name and figure out a way to get to the artifact (we save it in S3) which is non-obvious because Ray adds a few random number to the trial name (TorchTrainer_b6f96_00000 may become TorchTrainer_b6f96_00000_5439_4343 for instance)

Checkpointing code:

        # Save checkpoint
        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            from ray.train import Checkpoint

            checkpoint = None
            if global_rank == 0 and (
                (epoch + 1) % config["training"]["save_every"] == 0
                or epoch == EPOCHS - 1
            ):
                # This saves the model to the trial directory
                torch.save(
                    model.state_dict(), os.path.join(temp_checkpoint_dir, "model.pth")
                )
                # pickle the vocab
                import pickle

                with open(
                    os.path.join(temp_checkpoint_dir, "multi_vocab.pkl"), "wb"
                ) as f:
                    pickle.dump(multi_vocab, f)

                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            print("sending checkpoint data")
            # Send the current training result back to Tune
            ray.train.report(eval_perf, checkpoint=checkpoint)
            print("Done sending")

Use case

Better integration with ML Flow

The text was updated successfully, but these errors were encountered:

guillaumeguy · 2024-10-04T18:06:57Z

Ah, Found the solution for #1 and #2. It needs the mlflow prefix:

                MLflowLoggerCallback(
                    tracking_uri=os.environ["MLFLOW_TRACKING_URI"],
                    experiment_name=base_config["experiment"]["name"],
                    save_artifact=True,
                    tags={'mlflow.note.content' : 'this trial changes this param for reason X,Y,Z', 'mlflow.user': 'XXX'}
                )

CtrlMj · 2024-11-30T19:23:42Z

I am facing the same issue.
It seems that it used to be possible to find a desired ray checkpoint directly from mlflow run_id in the deprecated ray[air] version like so:

from ray.air import Result
from urllib.parse import urlparse
artifact_dir = urlparse(mlflow.get_run(run_id).info.artifact_uri).path  # get path from mlflow
results = Result.from_path(artifact_dir)
results.best_checkpoints[0][0]

guillaumeguy added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 4, 2024

anyscalesam added the tune Tune-related issues label Oct 7, 2024

justinvyu added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 10, 2024

justinvyu self-assigned this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

guillaumeguy commented Oct 4, 2024 •

edited

Loading

guillaumeguy commented Oct 4, 2024

CtrlMj commented Nov 30, 2024 •

edited

Loading

[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

[Ray component: Tune.logger] Better integration with MLFlow thru the MLflowLoggerCallback #47903

Comments

guillaumeguy commented Oct 4, 2024 • edited Loading

Description

Use case

guillaumeguy commented Oct 4, 2024

CtrlMj commented Nov 30, 2024 • edited Loading

guillaumeguy commented Oct 4, 2024 •

edited

Loading

CtrlMj commented Nov 30, 2024 •

edited

Loading