Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor codes and tests to match kedro==0.19 conventions #467

Merged
merged 4 commits into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,24 @@

## [Unreleased]

### Added

- :sparkles: Add support for python 3.11 ([#450, rxm7706](https://github.com/Galileo-Galilei/kedro-mlflow/pull/450))

### Changed

- :boom: :sparkles: Change default ``copy_mode`` to ``"assign"`` in ``KedroPipelineModel`` because this is the most efficient setup (and usually the desired one) when serving a Kedro ``Pipeline`` as a Mlflow model. This is different from Kedro's default which is to deepcopy the dataset ([#463, ShubhamZoro](https://github.com/Galileo-Galilei/kedro-mlflow/pull/463)).
- :boom: :recycle: ``MlflowArtifactDataset.__init__`` method ``data_set`` argument is renamed ``dataset`` to match new Kedro conventions ([#391](https://github.com/Galileo-Galilei/kedro-mlflow/pull/391)).
- :boom: :recycle: Rename the following ``DataSets`` with the ``Dataset`` suffix (without capitalized ``S``) to match new kedro conventions from ``kedro>=0.19`` and onwards ([#439, ShubhamZoro](https://github.com/Galileo-Galilei/kedro-mlflow/pull/439)):
- ``MlflowArtifactDataSet``->``MlflowArtifactDataset``
- ``MlflowAbstractModelDataSet``->``MlflowAbstractModelDataset``
- ``MlflowModelRegistryDataSet``->``MlflowModelRegistryDataset``
- ``MlflowMetricDataSet``->``MlflowMetricDataset``
- ``MlflowMetricHistoryDataSet``->``MlflowMetricHistoryDataset``
- ``MlflowMetricsDataSet``->``MlflowMetricsDataset``
- :boom: :recycle: Rename the following ``DataSets`` to make their use more explicit, and use the ``Dataset`` suffix ([#465, ShubhamZoro](https://github.com/Galileo-Galilei/kedro-mlflow/pull/465)):
- ``MlflowModelLoggerDataSet``->``MlflowModelTrackingDataset``
- ``MlflowModelSaverDataSet``->``MlflowModelLocalFileSystemDataset``

- :boom: :sparkles: Change default ``copy_mode`` to ``"assign"`` in ``KedroPipelineModel`` because this is the most efficient setup (and usually the desired one) when serving a Kedro ``Pipeline`` as a Mlflow model. This is different from Kedro's default which is to deepcopy the dataset ([#463, ShubhamZoro](https://github.com/Galileo-Galilei/kedro-mlflow/pull/463)).
- :boom: :recycle: Rename the following ``DataSets`` to make their use more explicit, and use the ``Dataset`` suffix:
- ``MlflowModelLoggerDataSet``->``MlflowModelTrackingDataset`` ([#391](https://github.com/Galileo-Galilei/kedro-mlflow/pull/391))
- ``MlflowModelSaverDataSet``->``MlflowModelLocalFileSystemDataset`` ([#391](https://github.com/Galileo-Galilei/kedro-mlflow/pull/391))
- ``MlflowMetricsDataSet``->``MlflowMetricsHistoryDataset`` ([#440](https://github.com/Galileo-Galilei/kedro-mlflow/pull/440))

## [0.11.10] - 2023-10-03

Expand Down Expand Up @@ -334,7 +339,7 @@
- `get_mlflow_config` now works in interactive mode if `load_context` is called with a path different from the working directory ([#30](https://github.com/Galileo-Galilei/kedro-mlflow/issues/30))
- ``kedro_mlflow`` now works fine with ``kedro jupyter notebook`` independently of the working directory ([#64](https://github.com/Galileo-Galilei/kedro-mlflow/issues/64))
- You can use global variables in `mlflow.yml` which is now properly parsed if you use a `TemplatedConfigLoader` ([#72](https://github.com/Galileo-Galilei/kedro-mlflow/issues/72))
- :bug: `MlflowMetricsDataset` now saves in the specified `run_id` instead of the current one when the prefix is not specified ([#62](https://github.com/Galileo-Galilei/kedro-mlflow/issues/62))
- :bug: `MlflowMetricsHistoryDataset` now saves in the specified `run_id` instead of the current one when the prefix is not specified ([#62](https://github.com/Galileo-Galilei/kedro-mlflow/issues/62))
- :memo: Other bug fixes and documentation improvements ([#6](https://github.com/Galileo-Galilei/kedro-mlflow/issues/6), [#99](https://github.com/Galileo-Galilei/kedro-mlflow/issues/99))

### Changed
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ The documentation contains:

Some frequently asked questions on more advanced features:

- You want to log additional metrics to the run? -> [Try ``MlflowMetricsDataset``](https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/05_version_metrics.html) !
- You want to log additional metrics to the run? -> [Try ``MlflowMetricsHistoryDataset``](https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/05_version_metrics.html) !
- You want to log nice dataviz of your pipeline that you register with ``MatplotlibWriter``? -> [Try ``MlflowArtifactDataset`` to log any local files (.png, .pkl, .csv...) *automagically*](https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/03_version_datasets.html)!
- You want to create easily an API to share your awesome model to anyone? -> [See if ``pipeline_ml_factory`` can fit your needs](https://github.com/Galileo-Galilei/kedro-mlflow/issues/16)
- You want to do something that is not straigthforward with current implementation? Open an issue, and let's see what happens!
Expand Down
2 changes: 1 addition & 1 deletion docs/source/01_introduction/02_motivation.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Above implementations have the advantage of being very straightforward and *mlfl
| Logging parameters | ``mlflow.yml`` | ``MlflowHook`` |
| Logging artifacts | ``catalog.yml`` | ``MlflowArtifactDataset`` |
| Logging models | ``catalog.yml`` | `MlflowModelTrackingDataset` and `MlflowModelLocalFileSystemDataset` |
| Logging metrics | ``catalog.yml`` | ``MlflowMetricsDataset`` |
| Logging metrics | ``catalog.yml`` | ``MlflowMetricsHistoryDataset`` |
| Logging Pipeline as model | ``hooks.py`` | ``KedroPipelineModel`` and ``pipeline_ml_factory`` |

`kedro-mlflow` does not currently provide interface to set tags outside a Kedro ``Pipeline``. Some of above decisions are subject to debate and design decisions (for instance, metrics are often updated in a loop during each epoch / training iteration and it does not always make sense to register the metric between computation steps, e.g. as a an I/O operation after a node run).
Expand Down
2 changes: 1 addition & 1 deletion docs/source/02_installation/03_migration_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Replace the following entries:
| old | new |
| :-------------------------------------- | :------------------------------------------------ |
| `kedro_mlflow.io.MlflowArtifactDataset` | `kedro_mlflow.io.artifacts.MlflowArtifactDataset` |
| `kedro_mlflow.io.MlflowMetricsDataset` | `kedro_mlflow.io.metrics.MlflowMetricsDataset` |
| `kedro_mlflow.io.MlflowMetricsHistoryDataset` | `kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset` |

### Hooks

Expand Down
2 changes: 1 addition & 1 deletion docs/source/03_getting_started/02_first_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ example_iris_data:

example_model:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pickle.PickleDataset
filepath: data/06_models/trained_model.pkl
```
Expand Down
16 changes: 8 additions & 8 deletions docs/source/04_experimentation_tracking/03_version_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ You can change it to:
```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/LOCAL/destination/file.csv # must be a local file, wherever you want to log the data in the end
```
Expand All @@ -40,19 +40,19 @@ and this dataset will be automatically versioned in each pipeline execution.

### Can I pass extra parameters to the ``MlflowArtifactDataset`` for finer control?

The ``MlflowArtifactDataset`` takes a ``data_set`` argument which is a python dictionary passed to the ``__init__`` method of the dataset declared in ``type``. It means that you can pass any argument accepted by the underlying dataset in this dictionary. If you want to pass ``load_args`` and ``save_args`` in the previous example, add them in the ``data_set`` argument:
The ``MlflowArtifactDataset`` takes a ``dataset`` argument which is a python dictionary passed to the ``__init__`` method of the dataset declared in ``type``. It means that you can pass any argument accepted by the underlying dataset in this dictionary. If you want to pass ``load_args`` and ``save_args`` in the previous example, add them in the ``dataset`` argument:

```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
load_args:
sep: ;
save_args:
sep: ;
# ... any other valid arguments for data_set
# ... any other valid arguments for dataset
```

### Can I use the ``MlflowArtifactDataset`` in interactive mode?
Expand All @@ -64,7 +64,7 @@ from kedro_mlflow.io.artifacts import MlflowArtifactDataset
from kedro_datasets.pandas import CSVDataset

csv_dataset = MlflowArtifactDataSet(
data_set={
dataset={
"type": CSVDataset, # either a string "pandas.CSVDataset" or the class
"filepath": r"/path/to/a/local/destination/file.csv",
}
Expand All @@ -90,7 +90,7 @@ The ``MlflowArtifactDataset`` has an extra attribute ``run_id`` which specifies
```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv # must be a local filepath, no matter what is your actual mlflow storage (S3 or other)
run_id: 13245678910111213 # a valid mlflow run to log in. If None, default to active run
Expand All @@ -105,7 +105,7 @@ You may want to reuse th artifact of a previous run to reuse it in another one,
```yaml
my_dataset_to_reload:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv # must be a local filepath, no matter what is your actual mlflow storage (S3 or other)
run_id: 13245678910111213 # a valid mlflow run with the existing artifact. It must be named "file.csv"
Expand All @@ -120,7 +120,7 @@ With below example, the artifact will be logged in mlflow within a `reporting` f
```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
artifact_path: reporting # relative path where the remote artifact must be stored. if None, saved in root folder.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ If you want to save your model both locally and remotely within the same run, yo
```yaml
sklearn_model:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: kedro_mlflow.io.models.MlflowModelLocalFileSystemDataset
flavor: mlflow.sklearn
filepath: data/06_models/sklearn_model
Expand Down
14 changes: 7 additions & 7 deletions docs/source/04_experimentation_tracking/05_version_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ MLflow defines a metric as "a (key, value) pair, where the value is numeric". Ea
`kedro-mlflow` introduces 3 ``AbstractDataset`` to manage metrics:
- ``MlflowMetricDataset`` which can log a float as a metric
- ``MlflowMetricHistoryDataset`` which can log the evolution over time of a given metric, e.g. a list or a dict of float.
- ``MlflowMetricsDataset``. It is a wrapper around a dictionary with metrics which is returned by node and log metrics in MLflow.
- ``MlflowMetricsHistoryDataset``. It is a wrapper around a dictionary with metrics which is returned by node and log metrics in MLflow.

### Saving a single float as a metric with ``MlflowMetricDataset``

Expand Down Expand Up @@ -148,13 +148,13 @@ my_model_metric:
mode: ... # OPTIONAL: "list" by default, one of {"list", "dict", "history"}
```

### Saving several metrics with their entire history with ``MlflowMetricsDataset``
### Saving several metrics with their entire history with ``MlflowMetricsHistoryDataset``

Since it is an ``AbstractDataset``, it can be used with the YAML API. You can define it in your ``catalog.yml`` as:

```yaml
my_model_metrics:
type: kedro_mlflow.io.metrics.MlflowMetricsDataset
type: kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset
```

You can provide a prefix key, which is useful in situations like when you have multiple nodes producing metrics with the same names which you want to distinguish. If you are using the ``v``, it will handle that automatically for you by giving as prefix metrics data set name. In the example above the prefix would be ``my_model_metrics``.
Expand All @@ -163,7 +163,7 @@ Let's look at an example with custom prefix:

```yaml
my_model_metrics:
type: kedro_mlflow.io.metrics.MlflowMetricsDataset
type: kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset
prefix: foo
```

Expand All @@ -179,7 +179,7 @@ def metrics_node() -> Dict[str, Union[float, List[float]]]:
}
```

As you can see above, ``kedro_mlflow.io.metrics.MlflowMetricsDataset`` can take metrics as:
As you can see above, ``kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset`` can take metrics as:

- ``Dict[str, key]``
- ``List[Dict[str, key]]``
Expand All @@ -188,7 +188,7 @@ To store metrics we need to define metrics dataset in Kedro Catalog:

```yaml
my_model_metrics:
type: kedro_mlflow.io.metrics.MlflowMetricsDataset
type: kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset
```

Within a kedro run, the ``MlflowHook`` will automatically prefix the metrics datasets with their name in the catalog. In our example, the metrics will be stored in Mlflow with the following keys: ``my_model_metrics.metric1``, ``my_model_metrics.metric2``.
Expand All @@ -197,7 +197,7 @@ It is also prossible to provide a prefix manually:

```yaml
my_model_metrics:
type: kedro_mlflow.io.metrics.MlflowMetricsDataset
type: kedro_mlflow.io.metrics.MlflowMetricsHistoryDataset
prefix: foo
```

Expand Down
8 changes: 4 additions & 4 deletions docs/source/07_python_objects/01_DataSets.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
```
Expand All @@ -17,14 +17,14 @@ or with additional parameters:
```yaml
my_dataset_to_version:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
dataset:
type: pandas.CSVDataset # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
load_args:
sep: ;
save_args:
sep: ;
# ... any other valid arguments for data_set
# ... any other valid arguments for dataset
run_id: 13245678910111213 # a valid mlflow run to log in. If None, default to active run
artifact_path: reporting # relative path where the artifact must be stored. if None, saved in root folder.
```
Expand All @@ -36,7 +36,7 @@ from kedro_mlflow.io.artifacts import MlflowArtifactDataset
from kedro_datasets.pandas import CSVDataset

csv_dataset = MlflowArtifactDataset(
data_set={"type": CSVDataset, "filepath": r"/path/to/a/local/destination/file.csv"}
dataset={"type": CSVDataset, "filepath": r"/path/to/a/local/destination/file.csv"}
)
csv_dataset.save(data=pd.DataFrame({"a": [1, 2], "b": [3, 4]}))
```
Expand Down
23 changes: 13 additions & 10 deletions kedro_mlflow/framework/hooks/mlflow_hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
from kedro_mlflow.io.metrics import (
MlflowMetricDataset,
MlflowMetricHistoryDataset,
MlflowMetricsDataset,
MlflowMetricsHistoryDataset,
)
from kedro_mlflow.mlflow import KedroPipelineModel
from kedro_mlflow.pipeline.pipeline_ml import PipelineML
Expand Down Expand Up @@ -113,7 +113,7 @@ def after_context_created(
experiment_name = context._package_name
if experiment_name is None:
# context._package_name may be None if the session is created interactively
metadata = _get_project_metadata(context._project_path)
metadata = _get_project_metadata(context.project_path)
experiment_name = metadata.package_name
mlflow_config.tracking.experiment.name = experiment_name

Expand All @@ -140,40 +140,43 @@ def after_catalog_created(
):
# we use this hooks to modif "MlflowmetricsDataset" to ensure consistency
# of the metric name with the catalog name
for name, dataset in catalog._data_sets.items():
if isinstance(dataset, MlflowMetricsDataset) and dataset._prefix is None:
for name, dataset in catalog._datasets.items():
if (
isinstance(dataset, MlflowMetricsHistoryDataset)
and dataset._prefix is None
):
if dataset._run_id is not None:
catalog._data_sets[name] = MlflowMetricsDataset(
catalog._datasets[name] = MlflowMetricsHistoryDataset(
run_id=dataset._run_id, prefix=name
)
else:
catalog._data_sets[name] = MlflowMetricsDataset(prefix=name)
catalog._datasets[name] = MlflowMetricsHistoryDataset(prefix=name)

if isinstance(dataset, MlflowMetricDataset) and dataset.key is None:
if dataset._run_id is not None:
catalog._data_sets[name] = MlflowMetricDataset(
catalog._datasets[name] = MlflowMetricDataset(
run_id=dataset._run_id,
key=name,
load_args=dataset._load_args,
save_args=dataset._save_args,
)
else:
catalog._data_sets[name] = MlflowMetricDataset(
catalog._datasets[name] = MlflowMetricDataset(
key=name,
load_args=dataset._load_args,
save_args=dataset._save_args,
)

if isinstance(dataset, MlflowMetricHistoryDataset) and dataset.key is None:
if dataset._run_id is not None:
catalog._data_sets[name] = MlflowMetricHistoryDataset(
catalog._datasets[name] = MlflowMetricHistoryDataset(
run_id=dataset._run_id,
key=name,
load_args=dataset._load_args,
save_args=dataset._save_args,
)
else:
catalog._data_sets[name] = MlflowMetricHistoryDataset(
catalog._datasets[name] = MlflowMetricHistoryDataset(
key=name,
load_args=dataset._load_args,
save_args=dataset._save_args,
Expand Down
12 changes: 6 additions & 6 deletions kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,20 @@ class MlflowArtifactDataset(AbstractVersionedDataset):

def __new__(
cls,
data_set: Union[str, Dict],
dataset: Union[str, Dict],
run_id: str = None,
artifact_path: str = None,
credentials: Dict[str, Any] = None,
):
data_set, data_set_args = parse_dataset_definition(config=data_set)
dataset, dataset_args = parse_dataset_definition(config=dataset)

# fake inheritance : this mlflow class should be a mother class which wraps
# all dataset (i.e. it should replace AbstractVersionedDataset)
# instead and since we can't modify the core package,
# we create a subclass which inherits dynamically from the data_set class
class MlflowArtifactDatasetChildren(data_set):
# we create a subclass which inherits dynamically from the dataset class
class MlflowArtifactDatasetChildren(dataset):
def __init__(self, run_id, artifact_path):
super().__init__(**data_set_args)
super().__init__(**dataset_args)
self.run_id = run_id
self.artifact_path = artifact_path
self._logging_activated = True
Expand Down Expand Up @@ -134,7 +134,7 @@ def _load(self) -> Any: # pragma: no cover
return super()._load()

# rename the class
parent_name = data_set.__name__
parent_name = dataset.__name__
MlflowArtifactDatasetChildren.__name__ = f"Mlflow{parent_name}"
MlflowArtifactDatasetChildren.__qualname__ = (
f"{parent_name}.Mlflow{parent_name}"
Expand Down
Loading
Loading