Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Closed
Closed
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f4330f
Add a tutorial on code beyond starter files
yury-fedotov May 6, 2024
036cbe0
Mention change in RELEASE.md
yury-fedotov May 6, 2024
02f0274
Address Vale comments on UK endings
yury-fedotov May 6, 2024
5875e2d
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov May 10, 2024
4d7a35a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ca00b72
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ff01643
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
b858406
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
144783b
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
cf3a65f
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
729dc1a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
2d22bbf
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 17, 2024
a58ff32
Link deepdives to a list of examples
yury-fedotov May 17, 2024
2ed9639
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 25, 2024
325489d
Revert weird MLflow release note edit
yury-fedotov May 25, 2024
1853e29
Remove note on changing registry location
yury-fedotov May 25, 2024
9fc168e
Simplify domain logic comment
yury-fedotov May 25, 2024
50a5bba
Replace tp.Dict by dict
yury-fedotov May 25, 2024
3024560
Remove pyproject.toml from monorepo tree example
yury-fedotov May 25, 2024
0df55ba
Replace historical and inference as pipeline split example
yury-fedotov May 25, 2024
a9f0e3a
Add a note about find_pipelines()
yury-fedotov May 25, 2024
c02309c
Remove article before find pipelines
yury-fedotov May 25, 2024
6ee9ad2
Apply suggestions from code review
merelcht Jul 9, 2024
8e811cf
Merge branch 'main' into docs/code-beyond-starter-files
merelcht Jul 9, 2024
fd96088
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 11, 2024
4310cf6
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 18, 2024
0e3dcf7
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 22, 2024
fe19a45
Merge branch 'main' into docs/code-beyond-starter-files
noklam Aug 5, 2024
c20e741
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Aug 28, 2024
a508fe0
Merge branch 'kedro-org:main' into docs/code-beyond-starter-files
yury-fedotov Sep 7, 2024
52ed71e
Implement Nok's comments re: utility functions
yury-fedotov Sep 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
## Documentation changes
* Improved documentation for configuring dataset parameters in the data catalog
* Extended documentation with an example of logging customisation at runtime
* Added a guide on extending a Kedro project beyond files generated by default

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand All @@ -59,6 +60,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:

## Documentation changes
* Improved documentation for custom starters
* Added a new section on deploying Kedro project on AWS Airflow MWAA
* Added a new docs section on deploying Kedro project on AWS Airflow MWAA
* Detailed instructions on using `globals` and `runtime_params` with the `OmegaConfigLoader`

Expand Down
136 changes: 136 additions & 0 deletions docs/source/kedro_project_setup/code_beyond_starter_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Adding code beyond starter files

After you [create a Kedro project](../get_started/new_project.md) and
[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a
few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`...

Check warning on line 5 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'few' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 5, "column": 1}}}, "severity": "WARNING"}

While those may be sufficient for a small project, they quickly become large, hard to

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'sufficient' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'sufficient' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 20}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.words] Use '' instead of 'quickly'. Raw Output: {"message": "[Kedro.words] Use '' instead of 'quickly'.", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'quickly' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'quickly' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located in those exact starter files, which is not true.

This section elaborates on what the Kedro requirements are for organising your project code
in files and modules.
It also provides examples of common scenarios such as sharing utilities between
pipelines and using Kedro in a monorepo setup.

Check warning on line 15 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.Spellings] Did you really mean 'monorepo'? Raw Output: {"message": "[Kedro.Spellings] Did you really mean 'monorepo'?", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 15, "column": 32}}}, "severity": "WARNING"}

## Where does Kedro look for code to be located

The only technical constraint for arranging code in the project is that `pipeline_registry.py`

Check warning on line 19 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 19, "column": 5}}}, "severity": "WARNING"}
and `settings.py` files must be located in the `<your_project>/src/<your_project>` directory, which is where
they are created by default.

`pipeline_registry.py` must have a `register_pipelines()` function that returns a `dict[str, Pipeline]`
mapping from pipeline name to a corresponding `Pipeline` object.

Other than that, **Kedro does not impose any constraints on where you should keep files with
`Pipeline`s, `Node`s, or functions wrapped by `node`**.

This being the only constraint means that you can, for example:

Check warning on line 29 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 29, "column": 16}}}, "severity": "WARNING"}
merelcht marked this conversation as resolved.
Show resolved Hide resolved
* Add `utils.py` file to a pipeline folder and import utilities used by multiple

Check warning on line 30 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'multiple' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 30, "column": 73}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid utils.py as much as possible as they are the bin for everything. It's ironic because kedro do have utils module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam how would you then call a .py file that is not nodes.py or pipeline.py for the purpose of this example? I was thinking of dataframe_utils.py, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is many thing that I don't like about utils (just me, not a team consensus). The purpose is ill-defined and often become the place that people dump code to without thinking.

Even if we go with utils. I will strip the _utils suffix, it feels redundant to have pandas_utils.py under utils.py. Then in the code I will probably do from <pacakge> import utils. When I need to use it, I will use utils.pandas.func to make it clear that this is a util namespace but not pandas.

visualitization_utils.py could just be visualisation module itself.
(all above are subjective)

I think it will be great to first introduce the principle of share module, what are the factors to consider. Then you can show this example. https://kedro-org.slack.com/archives/C03RKP2LW64/p1716912123397259
@datajoely do you have thought about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Comments re: naming utils modules and importing them - addressed ✔️
  • Re: introducing the principles, the hyperlink doesn't work for me, leads just to the questions channel.

functions in `nodes.py` from there.
* [Share modules between pipelines](#sharing-modules-between-pipelines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro that we can link to?

Even just the tree structure of a project would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that exactly what I'm adding there?

Screenshot 2024-05-25 at 12 47 31 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is good, but I think we can link https://github.com/kedro-org/awesome-kedro/blob/master/README.md#example-projects to direct people for more examples. It's also helpful to see real code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually went through all Kedro projects listed there, I don't think there's any one that implements shared things between pipeline in a way that would be a good example to follow here. The closest one is this one: https://github.com/pablovdcf/TFM_HADO_Cares/tree/main/hado/src/hado

Those could be utility functionalities, or a standalone module responsible for
the domain-specific logic.
* [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are

Check warning on line 35 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.Spellings] Did you really mean 'monorepo'? Raw Output: {"message": "[Kedro.Spellings] Did you really mean 'monorepo'?", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 35, "column": 19}}}, "severity": "WARNING"}
software components independent of Kedro that you want to keep together in the version control system.
* Delete or rename a default `nodes.py` file, split it into multiple files or modules.

Check warning on line 37 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'multiple' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 37, "column": 61}}}, "severity": "WARNING"}
* If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `dict[str, Pipeline]` objects in different places of the project
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are:

  • is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?


```{note}
While Kedro features [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.
noklam marked this conversation as resolved.
Show resolved Hide resolved
```

## Common codebase extension scenarios

This section provides examples of how you can handle some common cases of adding more
code to or around your Kedro project.
The provided examples are by no means the only ways to achieve the target scenarios,
and serve only as illustrative purposes.

### Sharing modules between pipelines

Oftentimes you have functions that have to be imported by multiple `pipelines`.
To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same
level as the `pipelines` folder**, and organise the functionalities there:

```text
├── conf
├── data
├── notebooks
└── src
├── my_project
│ ├── __init__.py
│ ├── __main__.py
│ ├── pipeline_registry.py
│ ├── settings.py
│ ├── pipelines
│ └── utils <-- Create a module to store your utilities
│ ├── __init__.py <-- Required to import from it
│ ├── pandas_utils.py <-- Put a file with utility functions here
│ ├── dictionary_utils.py <-- Or a few files
│ ├── visualisation_utils <-- Or sub-modules to organise even more utilities
└── tests
```

Example of importing a function `find_common_keys` from `dictionary_utils.py` would be:

```python
from my_project.utils.dictionary_utils import find_common_keys
```

```{note}
For imports like this to be displayed in an IDE properly, it is required to perform an editable
installation of the Kedro project to your virtual environment.
This is done via `pip install -e <root-of-kedro-project>`, the easiest way to achieve
this is to `cd` to the root of your Kedro project and run `pip install -e .`.
```

### Kedro project in a monorepo setup

The way a Kedro project is generated may build an impression that it should
only be acting as a root of a `git` repo. This is not true: just like you can combine
multiple Python packages in a single repo, you can combine multiple Kedro projects.
Or a Kedro project with other parts of your project's software stack.

```{note}
A practice of combining multiple, often unrelated software components in a single version
control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/).
```

A common use case of Kedro is that a software product built by a team has components that
are well separable from the Kedro project.

Let's use **a recommendation tool for production equipment operators** as an example.
This example consists of three parts:

| **#** | **Part** | **Considerations** |
|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry | <ul> <li>Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.</li> </ul> |
| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations | <ul> <li>A good design consideration might be to make it independent of the UI framework.</li> </ul> |
| 3 | User interface (UI) application | <ul> <li>This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.</li> <li>Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).</li> <li>Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.</li> </ul> |

A suggested solution in this case would be a **monorepo** design. Below is an example of such a project structure:

```text
└── repo_root
├── packages
│ ├── kedro_project <-- A Kedro project for ML model training.
│ │ ├── conf
│ │ ├── data
│ │ ├── notebooks
│ │ ├── ...
│ ├── optimizer <-- Standalone package.
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline.
├── ...
├── ... Examples of what you may want in the repo root:
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages.
├── ruff.toml <-- Or configs for other tools that you want to share between packages.
└── ...
```
1 change: 1 addition & 0 deletions docs/source/kedro_project_setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
dependencies
session
settings
code_beyond_starter_files
```
Loading