Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a docs page about adding code beyond starter files #3852

Closed
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f4330f
Add a tutorial on code beyond starter files
yury-fedotov May 6, 2024
036cbe0
Mention change in RELEASE.md
yury-fedotov May 6, 2024
02f0274
Address Vale comments on UK endings
yury-fedotov May 6, 2024
5875e2d
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov May 10, 2024
4d7a35a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ca00b72
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
ff01643
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
b858406
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
144783b
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
cf3a65f
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
729dc1a
Update docs/source/kedro_project_setup/code_beyond_starter_files.md
yury-fedotov May 17, 2024
2d22bbf
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 17, 2024
a58ff32
Link deepdives to a list of examples
yury-fedotov May 17, 2024
2ed9639
Merge branch 'refs/heads/main' into docs/code-beyond-starter-files
yury-fedotov May 25, 2024
325489d
Revert weird MLflow release note edit
yury-fedotov May 25, 2024
1853e29
Remove note on changing registry location
yury-fedotov May 25, 2024
9fc168e
Simplify domain logic comment
yury-fedotov May 25, 2024
50a5bba
Replace tp.Dict by dict
yury-fedotov May 25, 2024
3024560
Remove pyproject.toml from monorepo tree example
yury-fedotov May 25, 2024
0df55ba
Replace historical and inference as pipeline split example
yury-fedotov May 25, 2024
a9f0e3a
Add a note about find_pipelines()
yury-fedotov May 25, 2024
c02309c
Remove article before find pipelines
yury-fedotov May 25, 2024
6ee9ad2
Apply suggestions from code review
merelcht Jul 9, 2024
8e811cf
Merge branch 'main' into docs/code-beyond-starter-files
merelcht Jul 9, 2024
fd96088
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 11, 2024
4310cf6
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 18, 2024
0e3dcf7
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Jul 22, 2024
fe19a45
Merge branch 'main' into docs/code-beyond-starter-files
noklam Aug 5, 2024
c20e741
Merge branch 'main' into docs/code-beyond-starter-files
yury-fedotov Aug 28, 2024
a508fe0
Merge branch 'kedro-org:main' into docs/code-beyond-starter-files
yury-fedotov Sep 7, 2024
52ed71e
Implement Nok's comments re: utility functions
yury-fedotov Sep 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
## Breaking changes to the API

## Documentation changes
* Added a guide on extending a Kedro project beyond files generated by default.

## Community contributions

Expand Down
131 changes: 131 additions & 0 deletions docs/source/kedro_project_setup/code_beyond_starter_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Adding code beyond starter files

After you [create a Kedro project](../get_started/new_project.md) and
[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a
few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`...

Check warning on line 5 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'few' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 5, "column": 1}}}, "severity": "WARNING"}

While those may be sufficient for a small project, they quickly become large, hard to

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'sufficient' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'sufficient' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 20}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'quickly' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'quickly' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}

Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.words] Use '' instead of 'quickly'. Raw Output: {"message": "[Kedro.words] Use '' instead of 'quickly'.", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 7, "column": 57}}}, "severity": "WARNING"}
read and collaborate on as your codebase grows.
Those files also sometimes make new users think that Kedro requires code
to be located only in those starter files, which is not true.

Check warning on line 10 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 10, "column": 15}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the clarification, but I found this sounds like coming from an user rather than the official docs.

#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md?
Or to make content of comments on that issue part of this new section of docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I am not the best person to ask for English but I'll try my best🤓

While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true.

I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and find_pipeline use that convention), but not mandatory to name files pipeline.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:

  • As project evolves and nodes.py / pipeline.py files grow, they become challenging to manage.
  • Good news is that you can rename and restructure them.

Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.

merelcht marked this conversation as resolved.
Show resolved Hide resolved

This section elaborates what are the Kedro requirements in terms of organising code

Check warning on line 12 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'in terms of' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'in terms of' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 12, "column": 57}}}, "severity": "WARNING"}
merelcht marked this conversation as resolved.
Show resolved Hide resolved
in files and modules.
It also provides examples of common scenarios such as sharing utilities between
pipelines and using Kedro in a monorepo setup.

Check warning on line 15 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.Spellings] Did you really mean 'monorepo'? Raw Output: {"message": "[Kedro.Spellings] Did you really mean 'monorepo'?", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 15, "column": 32}}}, "severity": "WARNING"}

## Where does Kedro look for code to be located

The only technical constraint for arranging code in the project is that `pipeline_registry.py`

Check warning on line 19 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 19, "column": 5}}}, "severity": "WARNING"}
file must be located in `<your_project>/src/<your_project>` directory, which is where
it is created by default.

This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]`
mapping from pipeline name to corresponding `Pipeline` object.

Other than that, **Kedro does not impose any constraints on where you should keep files with
`Pipeline`s, `Node`s, or functions wrapped by `node`**.

```{note}
You actually can make Kedro look for pipeline registry in a different place by modifying the
`__main__.py` file of your project, but such advanced customizations are not in scope of this article.
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved
```

This being the only constraint means that you can, for example:

Check warning on line 34 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.weaselwords] 'only' is a weasel word! Raw Output: {"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 34, "column": 16}}}, "severity": "WARNING"}
merelcht marked this conversation as resolved.
Show resolved Hide resolved
* Add `utils.py` file to a pipeline folder and import utilities defined by multiple

Check warning on line 35 in docs/source/kedro_project_setup/code_beyond_starter_files.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.toowordy] 'multiple' is too wordy Raw Output: {"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/kedro_project_setup/code_beyond_starter_files.md", "range": {"start": {"line": 35, "column": 76}}}, "severity": "WARNING"}
functions in `nodes.py`.
* Delete or rename a default `nodes.py` file, split it into multiple files or modules.
* Instead of having a single `pipeline.py` in your pipeline folder, split it
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved
into `historical_pipeline.py` and `inference_pipeline.py`.
* Instead of registering many pipelines in `register_pipelines()` function one by one,
create a few `tp.Dict[str, Pipeline]` objects in different places of the project
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
and then make `register_pipelines()` return a union of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about find_pipelines()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions are:

  • is historical_pipeline.py and inference_pipelines.py better than pipelines/historical/pipeline.py (the modular pipeline structure) that Kedro usually promotes?

I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the pipeline.py is just a convention and for find_pipeline works automatically. User still have the option to register manually if desired.

@astrojuanlu thought?

* Store code that has nothing to do with Kedro `Pipeline` and `Node` concepts, or should
be reused by multiple pipelines of your project, in a module at the same level as the
`pipelines` folder of your project. This scenario is covered in more detail below.

## Common codebase extension scenarios

This section provides examples of how you can handle some common cases of adding more
code to or around your Kedro project.
**Provided implementations are by no means the only ways to achieve target scenarios**,
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved
and serve only illustrative purposes.
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved

### Sharing utilities between pipelines

Oftentimes you have utilities that have to be imported by multiple `pipelines`.
To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same
level as the `pipelines` folder**, and organise the functionalities there:

```text
├── conf
├── data
├── notebooks
└── src
├── my_project
│ ├── __init__.py
│ ├── __main__.py
│ ├── pipeline_registry.py
│ ├── settings.py
│ ├── pipelines
│ └── utils <-- Create a module to store your utilities
│ ├── __init__.py <-- Required to import from it
│ ├── pandas_utils.py <-- Put a file with utility functions here
│ ├── dictionary_utils.py <-- Or a few files
│ ├── visualization_utils <-- Or sub-modules to organize even more utilities
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved
└── tests
```

Example of importing a function `find_common_keys` from `dictionary_utils.py` would be:

```python
from my_project.utils.dictionary_utils import find_common_keys
```

```{note}
For imports like this to be displayed in IDE properly, it is required to perform an editable
merelcht marked this conversation as resolved.
Show resolved Hide resolved
installation of the Kedro project to your virtual environment.
This is done via `pip install -e <root-of-kedro-project>`, the easiest way to achieve
which is to `cd` to the root of Kedro project and do `pip install -e .`.
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved
```

### Kedro project in a monorepo setup

The way a Kedro project is generated may build an impression that it should
only be acting as a root of a `git` repo. This is not true: just like you can combine
multiple Python packages in a single repo, you can combine multiple Kedro projects.
Or a Kedro project with other parts of your project's software stack.

```{note}
A practice of combining multiple, often unrelated software components in a single version
control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/).
```

A common use case of Kedro is that a software product built by a team has components that
are well separable from the Kedro project.

Let's use **a recommendation tool for production equipment operators** as an example.
It would imply three parts:
yury-fedotov marked this conversation as resolved.
Show resolved Hide resolved

| **#** | **Part** | **Considerations** |
|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry | <ul> <li>Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.</li> </ul> |
| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations | <ul> <li>A good design consideration might be to make it independent of the UI framework.</li> </ul> |
| 3 | User interface (UI) application | <ul> <li>This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.</li> <li>Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).</li> <li>Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.</li> </ul> |

A suggested solution in this case would be a **monorepo** design. Below is an example:
merelcht marked this conversation as resolved.
Show resolved Hide resolved

```text
└── repo_root
├── packages
│ ├── kedro_project <-- A Kedro project for ML model training.
│ │ ├── conf
│ │ ├── data
│ │ ├── notebooks
│ │ ├── ...
│ ├── optimizer <-- Standalone package.
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline.
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages.
├── pyproject.toml <-- Settings for those, like `[tool.isort]`.
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
└── ...
```
1 change: 1 addition & 0 deletions docs/source/kedro_project_setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
dependencies
session
settings
code_beyond_starter_files
```
Loading