Skip to content

Commit

Permalink
PARTIAL #6 - Fix typo and refactor documentation organisation
Browse files Browse the repository at this point in the history
  • Loading branch information
Galileo-Galilei committed Nov 26, 2020
1 parent 63dcd50 commit 9957d4d
Show file tree
Hide file tree
Showing 36 changed files with 420 additions and 240 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@
### Fixed

- Fix `TypeError: unsupported operand type(s) for /: 'str' and 'str'` when using `MlflowArtifactDataSet` with `MlflowModelSaverDataSet` ([#116](https://github.com/Galileo-Galilei/kedro-mlflow/issues/116))
- Fix various docs typo

### Changed
- Refactor doc structure for readability

## [0.4.0] - 2020-11-03

Expand Down
9 changes: 6 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,12 @@ Welcome to kedro-mlflow's documentation!
:maxdepth: 4

Introduction <source/01_introduction/index.rst>
Hello world example <source/02_hello_world_example/index.rst>
Getting Started <source/03_tutorial/index.rst>
Python objects <source/05_python_objects/index.rst>
Installation <source/02_installation/index.rst>
Getting Started <source/03_getting_started/index.rst>
Advanced versioning of machine learning experimentations <source/04_experimentation_tracking/index.rst>
A comprehensive framework to deliver machine learning pipelines <source/05_framework_ml/index.rst>
Advanced capabilities <source/06_advanced_use/index.rst>
Python objects <source/07_python_objects/index.rst>


Indices and tables
Expand Down
48 changes: 27 additions & 21 deletions docs/source/01_introduction/01_introduction.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
# Introduction

## What is ``Kedro``?

``Kedro`` is a python package which facilitates the prototyping of data pipelines. It aims at implementing software engineering best practices (separation between I/O and compute, abstraction, templating...). It is specifically useful for machine learning projects since it provides within the same interface both interactive objects for the exploration phase and *Command Line Interface* (CLI) and configuration files for the production phase. This makes the transition from exploration to production as smooth as possible.
``Kedro`` is a python package which facilitates the prototyping of data pipelines. It aims at enforcing software engineering best practices (separation between I/O and compute, abstraction, templating...). It is specifically useful for machine learning projects since it provides within the same interface interactive objects for the exploration phase, and *Command Line Interface* (CLI) and configuration files for the production phase. This makes the transition from exploration to production as smooth as possible.

For more details, see [Kedro's official documentation](https://kedro.readthedocs.io/en/stable/01_introduction/01_introduction.html).

## What is ``Mlflow``?

``Mlflow`` is a library which helps managing the lifecycle of machine learning models. Mlflow provides 4 modules:
- ``Mlflow Tracking``: This modules focuses on experiment versioning. The goal is to store all the objects needed to reproduce any code execution. This includes code through version control, but also parameters and artifacts (i.e objects fitted on data like encoders, binarizers...). These elements vary wildly during machine learning experimentation phase. ``Mlflow`` also enable to track metrics to evaluate runs, and provides a *User Interface* (UI) to browse the different runs and compare them.
- ``Mlflow Projects``: This module provides a configuration files and CLI to enable reproducible execution of pipelines in production phase.
- ``Mlflow Models``: This module defines a standard way for packaging machine learning models, and provides built-in ways to serve registered models. Such standardization enable to serve these models across a wide range of tools.
- ``Mlflow Model Registry``: This modules aims at monitoring deployed models. The registry manages the transition between different versions of the same model (when the dataset is retrained on new data, or when parameters are updated) while it is in production.
``Mlflow`` is a library which manages the lifecycle of machine learning models. Mlflow provides 4 modules:

- [``Mlflow Tracking``](https://www.mlflow.org/docs/latest/tracking.html): This modules focuses on experiment versioning. Its goal is to store all the objects needed to reproduce any code execution. This includes code through version control, but also parameters and artifacts (i.e objects fitted on data like encoders, binarizers...). These elements vary wildly during machine learning experimentation phase. ``Mlflow`` also enable to track metrics to evaluate runs, and provides a *User Interface* (UI) to browse the different runs and compare them.
- [``Mlflow Projects``](https://www.mlflow.org/docs/latest/projects.html): This module provides a configuration files and CLI to enable reproducible execution of pipelines in production phase.
- [``Mlflow Models``](https://www.mlflow.org/docs/latest/models.html): This module defines a standard way for packaging machine learning models, and provides built-in ways to serve registered models. Such standardization enable to serve these models across a wide range of tools.
- [``Mlflow Model Registry``](https://www.mlflow.org/docs/latest/model-registry.html): This modules aims at monitoring deployed models. The registry manages the transition between different versions of the same model (when the dataset is retrained on new data, or when parameters are updated) while it is in production.

For more details, see [Mlflow's official documentation](https://www.mlflow.org/docs/latest/index.html).

Expand All @@ -24,55 +26,59 @@ While ``Kedro`` and ``Mlflow`` do not compete in the same field, they provide so
|I/O abstraction | various ``AbstractDataSet`` | N/A |
|I/O configuration files |- ``catalog.yml`` <br> - ``parameters.yml`` |``MLproject``|
|Compute abstraction|- ``Pipeline`` <br> - ``Node``| N/A |
|Compute configuration files|- ``pipeline.py`` <br> - ``run.py``| `MLproject` |
|Compute configuration files|- ``hooks.py`` <br> - ``run.py``| `MLproject` |
|Parameters and data versioning| - ``Journal`` <br> - ``AbstractVersionedDataSet`` |- ``log_metric``<br> - ``log_artifact``<br> - ``log_param``|
|Cli execution|command ``kedro run``|command ``mlflow run``|
|Code packaging|command ``kedro package``|N/A|
|Model packaging|N/A|- ``Mlflow Models`` (``mlflow.XXX.log_model`` functions) <br> - ``Mlflow Flavours``|
|Model service|N/A |commands ``mlflow models {serve/predict/deploy}``|

We can draw the following conclusions from the chart, discussed hereafter.
We discuss hereafter how the two libraries compete on the different functionalities and eventually complete each others.

### Configuration and prototyping: Kedro 1 - 0 Mlflow

``Mlflow`` and ``Kedro`` are essentially overlapping on the way they offer a dedicated configuration files for running the pipeline from CLI. However:

- ``Mlflow`` provides a single configuration file (the ``MLProject``) where all elements are declared (data, parameters and pipelines). Its goal is mainly to enable CLI execution of the project, but it is not very flexible. In my opinion, this file is **production oriented** and is not really intended to use for exploration.
- ``Kedro`` offers a bunch of files (``catalog.yml``, ``parameters.yml``, ``pipeline.py``) and their associated abstraction (``AbstractDataSet``, ``DataCatalog``, ``Pipeline`` and ``node`` objects). ``Kedro`` is much more opinionated: each object has a dedicated place (and only one!) in the template. This makes the framework both **exploration and production oriented**. The downside is that it could make the learning curve a bit sharper since a newcomer has to learn all ``Kedro`` specifications. It also provides a ``kedro-viz`` plugin to visualize the DAG interactively, which is particularly handy in medium-to-big projects.


|**``Kedro`` is a clear winner here, since it provides more functionnalities than ``Mlflow``. It handles very well _by design_ the exploration phase of data science projects when Mlflow is less flexible.**|
|:-|
> **``Kedro`` is a clear winner here, since it provides more functionnalities than ``Mlflow``. It handles very well _by design_ the exploration phase of data science projects when Mlflow is less flexible.**
### Versioning: Kedro 1 - 1 Mlflow
The ``Kedro`` [``Journal`` aims at reproducibility](https://kedro.readthedocs.io/en/stable/04_user_guide/13_journal.html), but is not focused on machine learning. The Journal keeps track of two elements:
- the CLI arguments , including *on the fly* parameters. This makes the command used to run the pipeline fully reproducible.

The ``Kedro`` [``Journal`` aims at reproducibility](https://kedro.readthedocs.io/en/stable/04_user_guide/13_journal.html), but is not focused on machine learning. The `Journal` keeps track of two elements:

- the CLI arguments, including *on the fly* parameters. This makes the command used to run the pipeline fully reproducible.
- the ``AbstractVersionedDataSet`` for which versioning is activated. It consists in copying the data whom ``versioned`` argument is ``True`` when the ``save`` method of the ``AbstractVersionedDataSet`` is called.
This approach suffers from two main drawbacks:
- the configuration is assumed immutable (including parameters), which is not realistic ni machine learning projects where they are very volatile. To fix this, the ``git sha`` has been recently added to the ``Journal``, but it has still some bugs in my experience (including the fact that the current ``git sha`` is logged even if the pipeline is ran with uncommitted change, which prevents reproducibility). This is still recent and will likely evolve in the future.
- there is no support for browsing old runs, which prevents [cleaning the database with old and unused datasets](https://github.com/quantumblacklabs/kedro/issues/406), compare runs between each other...
- the configuration is assumed immutable (including parameters), which is not realistic ni machine learning projects where they are very volatile. To fix this, the ``git sha`` has been recently added to the ``Journal``, but it has still some bugs in my experience (including the fact that the current ``git sha`` is logged even if the pipeline is ran with uncommitted change, which prevents reproducibility). This is still recent and will likely evolve in the future.
- there is no support for browsing old runs, which prevents [cleaning the database with old and unused datasets](https://github.com/quantumblacklabs/kedro/issues/406), compare runs between each other...

On the other hand, ``Mlflow``:
- distinguishes between artifacts (i.e. any data file), metrics (integers that may evolve over time) and parameters. The logging is very straightforward since there is a one-liner function for logging the desired type. This separation makes further manipulation easier.

- distinguishes between artifacts (i.e. any data file), metrics (integers that may evolve over time) and parameters. The logging is very straightforward since there is a one-liner function for logging the desired type. This separation makes further manipulation easier.
- offers a way to configure the logging in a database through the ``mlflow_tracking_uri`` parameter. This database-like logging comes with easy [querying of different runs through a client](https://www.mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.MlflowClient) (for instance "find the most recent run with a metric at least above a given threshold" is immediate with ``Mlflow`` but hacky in ``Kedro``).
- [comes with a *User Interface* (UI)](https://mlflow.org/docs/latest/tracking.html#id7) which enable to browse / filter / sort the runs, display graphs of the metrics, render plots... This make the run management much easier than in ``Kedro``.
- [comes with a *User Interface* (UI)](https://mlflow.org/docs/latest/tracking.html#id7) which enable to browse / filter / sort the runs, display graphs of the metrics, render plots... This make the run management much easier than in ``Kedro``.
- has a command to reproduce exactly the run from a given ``git sha``, [which is not possible in ``Kedro``](https://github.com/quantumblacklabs/kedro/issues/297).

|**``Mlflow`` is a clear winner here, because _UI_ and _run querying_ are must-have for machine learning projects. It is more mature than ``Kedro`` for versioning and more focused on machine learning.**|
|:-|
> **``Mlflow`` is a clear winner here, because _UI_ and _run querying_ are must-have for machine learning projects. It is more mature than ``Kedro`` for versioning and more focused on machine learning.**
### Model packaging and service: Kedro 1 - 2 Mlflow

``Kedro`` offers a way to package the code to make the pipelines callable, but does not manage specifically machine learning models.

``Mlflow`` offers a way to store machine learning models with a given "flavor", which is the minimal amount of information necessary to use the model for prediction:

- a configuration file
- all the artifacts, i.e. the necessary data for the model to run (including encoder, binarizer...)
- a loader
- a conda configuration through an ``environment.yml`` file

When a stored model meets these requirements, ``Mlflow`` provides built-in tools to serve the model (as an API or for batch prediction) on many machine learning tools (Microsoft Azure ML, Amazon Sagemaker, Apache SparkUDF) and locally.

|**``Mlflow`` is currently the only tool which adresses model serving. This is currently not the top priority for ``Kedro``, but may come in the future ([through Kedro Server maybe?](https://github.com/quantumblacklabs/kedro/issues/143))**|
|:-|
> **``Mlflow`` is currently the only tool which adresses model serving. This is currently not the top priority for ``Kedro``, but may come in the future ([through Kedro Server maybe?](https://github.com/quantumblacklabs/kedro/issues/143))**
### Conclusion: Use Kedro and add Mlflow for machine learning projects

In my opinion, ``Kedro``'s will to enforce software engineering best practice makes it really useful for machine learning teams. It is extremely well documented and the support is excellent, which makes it very user friendly even for people with no CS background. However, it lacks some machine learning-specific functionalities (better versioning, model service), and it is where ``Mlflow`` fills the gap.
In my opinion, ``Kedro``'s will to enforce software engineering best practice makes it really useful for machine learning teams. It is extremely well documented and the support is excellent, which makes it very user friendly even for people with no computer science background. However, it lacks some machine learning-specific functionalities (better versioning, model service), and it is where ``Mlflow`` fills the gap.
Loading

0 comments on commit 9957d4d

Please sign in to comment.