-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a docs page about adding code beyond starter files #3852
Changes from 4 commits
5f4330f
036cbe0
02f0274
5875e2d
4d7a35a
ca00b72
ff01643
b858406
144783b
cf3a65f
729dc1a
2d22bbf
a58ff32
2ed9639
325489d
1853e29
9fc168e
50a5bba
3024560
0df55ba
a9f0e3a
c02309c
6ee9ad2
8e811cf
fd96088
4310cf6
0e3dcf7
fe19a45
c20e741
a508fe0
52ed71e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Adding code beyond starter files | ||
|
||
After you [create a Kedro project](../get_started/new_project.md) and | ||
[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a | ||
few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`... | ||
Check warning on line 5 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
|
||
While those may be sufficient for a small project, they quickly become large, hard to | ||
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
Check warning on line 7 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
read and collaborate on as your codebase grows. | ||
Those files also sometimes make new users think that Kedro requires code | ||
to be located only in those starter files, which is not true. | ||
Check warning on line 10 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This section elaborates what are the Kedro requirements in terms of organising code | ||
Check warning on line 12 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
in files and modules. | ||
It also provides examples of common scenarios such as sharing utilities between | ||
pipelines and using Kedro in a monorepo setup. | ||
Check warning on line 15 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
|
||
## Where does Kedro look for code to be located | ||
|
||
The only technical constraint for arranging code in the project is that `pipeline_registry.py` | ||
Check warning on line 19 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
file must be located in `<your_project>/src/<your_project>` directory, which is where | ||
it is created by default. | ||
|
||
This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` | ||
mapping from pipeline name to corresponding `Pipeline` object. | ||
|
||
Other than that, **Kedro does not impose any constraints on where you should keep files with | ||
`Pipeline`s, `Node`s, or functions wrapped by `node`**. | ||
|
||
```{note} | ||
You actually can make Kedro look for pipeline registry in a different place by modifying the | ||
`__main__.py` file of your project, but such advanced customizations are not in scope of this article. | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
This being the only constraint means that you can, for example: | ||
Check warning on line 34 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Add `utils.py` file to a pipeline folder and import utilities defined by multiple | ||
Check warning on line 35 in docs/source/kedro_project_setup/code_beyond_starter_files.md GitHub Actions / runner / vale
|
||
functions in `nodes.py`. | ||
* Delete or rename a default `nodes.py` file, split it into multiple files or modules. | ||
* Instead of having a single `pipeline.py` in your pipeline folder, split it | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
into `historical_pipeline.py` and `inference_pipeline.py`. | ||
* Instead of registering many pipelines in `register_pipelines()` function one by one, | ||
create a few `tp.Dict[str, Pipeline]` objects in different places of the project | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and then make `register_pipelines()` return a union of those. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My questions are:
I think this is a viable alternative, but if this is in docs instead of a blog. I'll probably change the narrative to: There is an alternative to register pipeline manually, explaining the @astrojuanlu thought? |
||
* Store code that has nothing to do with Kedro `Pipeline` and `Node` concepts, or should | ||
be reused by multiple pipelines of your project, in a module at the same level as the | ||
`pipelines` folder of your project. This scenario is covered in more detail below. | ||
|
||
## Common codebase extension scenarios | ||
|
||
This section provides examples of how you can handle some common cases of adding more | ||
code to or around your Kedro project. | ||
**Provided implementations are by no means the only ways to achieve target scenarios**, | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and serve only illustrative purposes. | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Sharing utilities between pipelines | ||
|
||
Oftentimes you have utilities that have to be imported by multiple `pipelines`. | ||
To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same | ||
level as the `pipelines` folder**, and organise the functionalities there: | ||
|
||
```text | ||
├── conf | ||
├── data | ||
├── notebooks | ||
└── src | ||
├── my_project | ||
│ ├── __init__.py | ||
│ ├── __main__.py | ||
│ ├── pipeline_registry.py | ||
│ ├── settings.py | ||
│ ├── pipelines | ||
│ └── utils <-- Create a module to store your utilities | ||
│ ├── __init__.py <-- Required to import from it | ||
│ ├── pandas_utils.py <-- Put a file with utility functions here | ||
│ ├── dictionary_utils.py <-- Or a few files | ||
│ ├── visualization_utils <-- Or sub-modules to organize even more utilities | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
└── tests | ||
``` | ||
|
||
Example of importing a function `find_common_keys` from `dictionary_utils.py` would be: | ||
|
||
```python | ||
from my_project.utils.dictionary_utils import find_common_keys | ||
``` | ||
|
||
```{note} | ||
For imports like this to be displayed in IDE properly, it is required to perform an editable | ||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
installation of the Kedro project to your virtual environment. | ||
This is done via `pip install -e <root-of-kedro-project>`, the easiest way to achieve | ||
which is to `cd` to the root of Kedro project and do `pip install -e .`. | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
### Kedro project in a monorepo setup | ||
|
||
The way a Kedro project is generated may build an impression that it should | ||
only be acting as a root of a `git` repo. This is not true: just like you can combine | ||
multiple Python packages in a single repo, you can combine multiple Kedro projects. | ||
Or a Kedro project with other parts of your project's software stack. | ||
|
||
```{note} | ||
A practice of combining multiple, often unrelated software components in a single version | ||
control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/). | ||
``` | ||
|
||
A common use case of Kedro is that a software product built by a team has components that | ||
are well separable from the Kedro project. | ||
|
||
Let's use **a recommendation tool for production equipment operators** as an example. | ||
It would imply three parts: | ||
yury-fedotov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
| **#** | **Part** | **Considerations** | | ||
|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry | <ul> <li>Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.</li> </ul> | | ||
| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations | <ul> <li>A good design consideration might be to make it independent of the UI framework.</li> </ul> | | ||
| 3 | User interface (UI) application | <ul> <li>This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.</li> <li>Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).</li> <li>Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.</li> </ul> | | ||
|
||
A suggested solution in this case would be a **monorepo** design. Below is an example: | ||
merelcht marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```text | ||
└── repo_root | ||
├── packages | ||
│ ├── kedro_project <-- A Kedro project for ML model training. | ||
│ │ ├── conf | ||
│ │ ├── data | ||
│ │ ├── notebooks | ||
│ │ ├── ... | ||
│ ├── optimizer <-- Standalone package. | ||
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. | ||
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. | ||
├── pyproject.toml <-- Settings for those, like `[tool.isort]`. | ||
astrojuanlu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
└── ... | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,4 +6,5 @@ | |
dependencies | ||
session | ||
settings | ||
code_beyond_starter_files | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the clarification, but I found this sounds like coming from an user rather than the official docs.
#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the
code_beyond_starter_files.md
?Or to make content of comments on that issue part of this new section of docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I am not the best person to ask for English but I'll try my best🤓
I'd rephrase it to something like "When project become large, it may be beneficial to adopt a different structure or give pipeline files a more specific names." It is by convention (and
find_pipeline
use that convention), but not mandatory to name filespipeline.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Your examples is specifically around pipeline files, while I wanted to convey 2 things here:
nodes.py
/pipeline.py
files grow, they become challenging to manage.Do you disagree with those? I think the first one is just a fact based on how git diff works - if you have one big file, you're more likely to have merge conflicts, etc. And the second one, I wanted to cover not only the pipeline, but also node files there.