Galileo-Galilei · Galileo-Galilei · Nov 11, 2021 · Nov 9, 2021
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@
 
 ### Changed
 
+- :sparkles: :boom: The ``pipeline_ml_factory`` accepts 2 new arguments ``log_model_kwargs`` (which will be passed *as is* to ``mlflow.pyfunc.log_model``) and ``kpm_kwargs`` (which will be passed *as is* to ``KedroPipelineModel``). This ensures perfect consistency with mlflow API and offers new possibility like saving the project source code alongisde the model ([#67](https://github.com/Galileo-Galilei/kedro-mlflow/issues/67)). Note that ``model_signature``, ``conda_env`` and ``model_name`` arguments are removed, and replace respectively by ``log_model_kwargs["signature"]``, ``log_model_kwargs["conda_env"]`` and ``log_model_kwargs["artifact_path"]``.
 - :sparkles: :boom: The ``KedroPipelineModel`` custom mlflow model now accepts any kedro `Pipeline` as input (provided they have a single DataFrame input and a single output because this is an mlflow limitation) instead of only ``PipelineML`` objects. This simplifies the API for user who want to customise the model logging ([#171](https://github.com/Galileo-Galilei/kedro-mlflow/issues/171)). `KedroPipelineModel.__init__` argument `pipeline_ml` is renamed `pipeline` to reflect this change.
 - :wastebasket: `kedro_mlflow.io.metrics.MlflowMetricsDataSet` is no longer deprecated because there is no alternative for now to log many metrics at the same time.
 

diff --git a/docs/source/05_framework_ml/03_framework_solutions.md b/docs/source/05_framework_ml/03_framework_solutions.md
@@ -21,6 +21,7 @@ The use of ``pipeline_ml_factory`` is very straightforward, especially if you ha
 # hooks.py
 from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline
 
+
 class ProjectHooks:
     @hook_impl
     def register_pipelines(self) -> Dict[str, Pipeline]:
@@ -31,13 +32,10 @@ class ProjectHooks:
         training_pipeline_ml = pipeline_ml_factory(
             training=ml_pipeline.only_nodes_with_tags("training"),
             inference=ml_pipeline.only_nodes_with_tags("inference"),
-            input_name="instances"
+            input_name="instances",
         )
 
-        return {
-            "__default__": training_pipeline_ml
-        }
-
+        return {"__default__": training_pipeline_ml}
 ```
 
 > So, what? We have created a link between our two pipelines, but the gain is not obvious at first glance. The 2 following sections demonstrates that such a construction enables to package and serve automatically the inference pipeline when executing the training one.
@@ -66,17 +64,14 @@ artifacts = pipeline_training.extract_pipeline_artifacts(catalog)
 input_data = catalog.load(pipeline_training.input_name)
 model_signature = infer_signature(model_input=input_data)
 
-kedro_model = KedroPipelineModel(
-    pipeline=pipeline_training,
-    catalog=catalog
-)
+kedro_model = KedroPipelineModel(pipeline=pipeline_training, catalog=catalog)
 
 mlflow.pyfunc.log_model(
     artifact_path="model",
     python_model=kedro_model,
     artifacts=artifacts,
     conda_env={"python": "3.7.0", dependencies: ["kedro==0.16.5"]},
-    model_signature=model_signature
+    signature=model_signature,
 )
 ```
 

diff --git a/docs/source/05_framework_ml/04_example_project.md b/docs/source/05_framework_ml/04_example_project.md
@@ -11,33 +11,33 @@ If you don't want to read the entire explanations, here is a summary:
     # hooks.py
     from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline
 
-    ...
+    do_something()
+
 
     class ProjectHooks:
         @hook_impl
         def register_pipelines(self) -> Dict[str, Pipeline]:
 
-            ...
+            do_something()
 
             ml_pipeline = create_ml_pipeline()
             training_pipeline_ml = pipeline_ml_factory(
                 training=ml_pipeline.only_nodes_with_tags("training"),
                 inference=ml_pipeline.only_nodes_with_tags("inference"),
                 input_name="instances",
-                model_name="kedro_mlflow_tutorial",
-                conda_env={
-                    "python": 3.7,
-                    "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"],
-                },
-                model_signature="auto",
+                log_model_kwargs=dict(
+                    artifact_path="kedro_mlflow_tutorial",
+                    conda_env={
+                        "python": 3.7,
+                        "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"],
+                    },
+                    signature="auto",
+                ),
             )
 
-            ...
+            do_something()
 
-            return {
-                "training": training_pipeline_ml,
-                ...
-            }
+            return {"training": training_pipeline_ml}
     ```
 
 3. Persist your artifacts locally in the ``catalog.yml``

diff --git a/docs/source/07_python_objects/01_DataSets.md b/docs/source/07_python_objects/01_DataSets.md
@@ -14,7 +14,7 @@ my_dataset_to_version:
 
 or with additional parameters:
 
-```python
+```yaml
 my_dataset_to_version:
     type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
     data_set:
@@ -34,9 +34,11 @@ or with the python API:
 ```python
 from kedro_mlflow.io.artifacts import MlflowArtifactDataSet
 from kedro.extras.datasets.pandas import CSVDataSet
-csv_dataset = MlflowArtifactDataSet(data_set={"type": CSVDataSet,
-                                      "filepath": r"/path/to/a/local/destination/file.csv"})
-csv_dataset.save(data=pd.DataFrame({"a":[1,2], "b": [3,4]}))
+
+csv_dataset = MlflowArtifactDataSet(
+    data_set={"type": CSVDataSet, "filepath": r"/path/to/a/local/destination/file.csv"}
+)
+csv_dataset.save(data=pd.DataFrame({"a": [1, 2], "b": [3, 4]}))
 ```
 
 ## Metrics `DataSets`
@@ -69,29 +71,33 @@ You can either only specify the flavor:
 from kedro_mlflow.io.models import MlflowModelLoggerDataSet
 from sklearn.linear_model import LinearRegression
 
-mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn")
+mlflow_model_logger = MlflowModelLoggerDataSet(flavor="mlflow.sklearn")
 mlflow_model_logger.save(LinearRegression())
 ```
 
 Let assume that this first model has been saved once, and you xant to retrieve it (for prediction for instance):
 
 ```python
-mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn", run_id=<the-model-run-id>)
-my_linear_regression=mlflow_model_logger.load()
-my_linear_regression.predict(<data>) # will obviously fail if you have not fitted your model object first :)
+mlflow_model_logger = MlflowModelLoggerDataSet(
+    flavor="mlflow.sklearn", run_id="<the-model-run-id>"
+)
+my_linear_regression = mlflow_model_logger.load()
+my_linear_regression.predict(
+    data
+)  # will obviously fail if you have not fitted your model object first :)
 ```
 
 You can also specify some [logging parameters](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model):
 
 ```python
-mlflow_model_logger=MlflowModelLoggerDataSet(
+mlflow_model_logger = MlflowModelLoggerDataSet(
     flavor="mlflow.sklearn",
-     run_id=<the-model-run-id>,
-     save_args={
-         "conda_env": {"python": "3.7.0", , "dependencies": ["kedro==0.16.5"]},
-          "input_example": data.iloc[0:5,:]
-          }
-    )
+    run_id="<the-model-run-id>",
+    save_args={
+        "conda_env": {"python": "3.7.0", "dependencies": ["kedro==0.16.5"]},
+        "input_example": data.iloc[0:5, :],
+    },
+)
 mlflow_model_logger.save(LinearRegression().fit(data))
 ```
 
@@ -126,18 +132,21 @@ The use is very similar to MlflowModelLoggerDataSet, but that you specify a file
 from kedro_mlflow.io.models import MlflowModelLoggerDataSet
 from sklearn.linear_model import LinearRegression
 
-mlflow_model_logger=MlflowModelSaverDataSet(flavor="mlflow.sklearn", filepath="path/to/where/you/want/model")
+mlflow_model_logger = MlflowModelSaverDataSet(
+    flavor="mlflow.sklearn", filepath="path/to/where/you/want/model"
+)
 mlflow_model_logger.save(LinearRegression().fit(data))
 ```
 
 The same arguments are available, plus an additional [`version` common to usual `AbstractVersionedDataSet`](https://kedro.readthedocs.io/en/stable/kedro.io.AbstractVersionedDataSet.html)
 
 ```python
-mlflow_model_logger=MlflowModelSaverDataSet(
+mlflow_model_logger = MlflowModelSaverDataSet(
     flavor="mlflow.sklearn",
     filepath="path/to/where/you/want/model",
-    version="<valid-kedro-version>")
-my_model= mlflow_model_logger.load()
+    version="<valid-kedro-version>",
+)
+my_model = mlflow_model_logger.load()
 ```
 
 and with the YAML API in the `catalog.yml`:

diff --git a/docs/source/07_python_objects/03_Pipelines.md b/docs/source/07_python_objects/03_Pipelines.md
@@ -13,17 +13,21 @@ Example within kedro template:
 
 from PYTHON_PACKAGE.pipelines import data_science as ds
 
+
 def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
     data_science_pipeline = ds.create_pipeline()
-    training_pipeline = pipeline_ml_factory(training=data_science_pipeline.only_nodes_with_tags("training"), # or whatever your logic is for filtering
-                                            inference=data_science_pipeline.only_nodes_with_tags("inference"))
+    training_pipeline = pipeline_ml_factory(
+        training=data_science_pipeline.only_nodes_with_tags(
+            "training"
+        ),  # or whatever your logic is for filtering
+        inference=data_science_pipeline.only_nodes_with_tags("inference"),
+    )
 
     return {
         "ds": data_science_pipeline,
         "training": training_pipeline,
         "__default__": data_engineering_pipeline + data_science_pipeline,
     }
-
 ```
 
 Now each time you will run ``kedro run --pipeline=training`` (provided you registered ``MlflowPipelineHook`` in you ``run.py``), the full inference pipeline will be registered as a mlflow model (with all the outputs produced by training as artifacts : the machine learning model, but also the *scaler*, *vectorizer*, *imputer*, or whatever object fitted on data you create in ``training`` and that is used in ``inference``).
@@ -55,24 +59,17 @@ model_signature = infer_signature(model_input=input_data)
 
 mlflow.pyfunc.log_model(
     artifact_path="model",
-    python_model=KedroPipelineModel(
-            pipeline=pipeline_training,
-            catalog=catalog
-        ),
+    python_model=KedroPipelineModel(pipeline=pipeline_training, catalog=catalog),
     artifacts=artifacts,
-    conda_env={"python": "3.7.0", , "dependencies": ["kedro==0.16.5"]},
-    model_signature=model_signature
+    conda_env={"python": "3.7.0", "dependencies": ["kedro==0.16.5"]},
+    signature=model_signature,
 )
 ```
 
 It is also possible to pass arguments to `KedroPipelineModel` to specify the runner or the copy_mode of MemoryDataSet for the inference Pipeline. This may be faster especially for  compiled model (e.g keras, tensorflow), and more suitable for an API serving pattern.
 
 ```python
-KedroPipelineModel(
-            pipeline=pipeline_training,
-            catalog=catalog,
-            copy_mode="assign"
-        )
+KedroPipelineModel(pipeline=pipeline_training, catalog=catalog, copy_mode="assign")
 ```
 
 Available `copy_mode` are "assign", "copy" and "deepcopy". It is possible to pass a dictionary to specify different copy mode fo each dataset.
diff --git a/kedro_mlflow/framework/hooks/pipeline_hook.py b/kedro_mlflow/framework/hooks/pipeline_hook.py
@@ -1,11 +1,9 @@
 import logging
-import sys
 from pathlib import Path
 from tempfile import TemporaryDirectory
-from typing import Any, Dict, Union
+from typing import Any, Dict
 
 import mlflow
-import yaml
 from kedro.framework.hooks import hook_impl
 from kedro.io import DataCatalog
 from kedro.pipeline import Pipeline
@@ -23,7 +21,6 @@
 )
 from kedro_mlflow.mlflow import KedroPipelineModel
 from kedro_mlflow.pipeline.pipeline_ml import PipelineML
-from kedro_mlflow.utils import _parse_requirements
 
 
 class MlflowPipelineHook:
@@ -179,29 +176,29 @@ def after_pipeline_run(
             if isinstance(pipeline, PipelineML):
                 with TemporaryDirectory() as tmp_dir:
                     # This will be removed at the end of the context manager,
-                    # but we need to log in mlflow beforeremoving the folder
+                    # but we need to log in mlflow before moving the folder
                     kedro_pipeline_model = KedroPipelineModel(
                         pipeline=pipeline.inference,
                         catalog=catalog,
                         input_name=pipeline.input_name,
-                        **pipeline.kwargs,
+                        **pipeline.kpm_kwargs,
                     )
                     artifacts = kedro_pipeline_model.extract_pipeline_artifacts(
                         parameters_saving_folder=Path(tmp_dir)
                     )
 
-                    if pipeline.model_signature == "auto":
-                        input_data = catalog.load(pipeline.input_name)
-                        model_signature = infer_signature(model_input=input_data)
-                    else:
-                        model_signature = pipeline.model_signature
+                    log_model_kwargs = pipeline.log_model_kwargs.copy()
+                    model_signature = log_model_kwargs.pop("signature", None)
+                    if isinstance(model_signature, str):
+                        if model_signature == "auto":
+                            input_data = catalog.load(pipeline.input_name)
+                            model_signature = infer_signature(model_input=input_data)
 
                     mlflow.pyfunc.log_model(
-                        artifact_path=pipeline.model_name,
                         python_model=kedro_pipeline_model,
                         artifacts=artifacts,
-                        conda_env=_format_conda_env(pipeline.conda_env),
                         signature=model_signature,
+                        **log_model_kwargs,
                     )
             # Close the mlflow active run at the end of the pipeline to avoid interactions with further runs
             mlflow.end_run()
@@ -278,59 +275,3 @@ def _generate_kedro_command(
 
     kedro_cmd = " ".join(cmd_list)
     return kedro_cmd
-
-
-def _format_conda_env(
-    conda_env: Union[str, Path, Dict[str, Any]] = None
-) -> Dict[str, Any]:
-    """Best effort to get dependecies of the project.
-
-    Keyword Arguments:
-        conda_env {Union[str, Path, Dict[str, Any]]} -- It can be either :
-            - a path to a "requirements.txt": In this case
-            the packages are parsed and a conda env with
-            your current python_version and these dependencies is returned
-            - a path to an "environment.yml" : data is loaded and used as they are
-            - a Dict : used as the environment
-            - None: a base conda environment with your current python version and your project version at training time.
-            Defaults to None.
-
-    Returns:
-        Dict[str, Any] -- A dictionary which contains all informations to dump it to a conda environment.yml file.
-    """
-    python_version = ".".join(
-        [
-            str(sys.version_info.major),
-            str(sys.version_info.minor),
-            str(sys.version_info.micro),
-        ]
-    )
-    if isinstance(conda_env, str):
-        conda_env = Path(conda_env)
-
-    if isinstance(conda_env, Path):
-        if conda_env.suffix in (".yml", ".yaml"):
-            with open(conda_env, mode="r") as file_handler:
-                conda_env = yaml.safe_load(file_handler)
-        elif conda_env.suffix in (".txt"):
-            conda_env = {
-                "python": python_version,
-                "dependencies": _parse_requirements(conda_env),
-            }
-    elif conda_env is None:
-        conda_env = {"python": python_version}
-    elif isinstance(conda_env, dict):
-        return conda_env
-    else:
-        raise ValueError(
-            """Invalid conda_env. It can be either :
-            - a Dict : used as the environment without control
-            - None (default: {None}) : Only the python vresion will be stored.
-            - a path to a "requirements.txt": In this case
-            the packages are parsed and a conda env with
-            your current python_version and these dependencies is returned
-            - a path to an "environment.yml" : data is loaded and used as they are
-            """
-        )
-
-    return conda_env