Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Kedro Support for Nested Folder Structures #3105

Closed
Gabriel2409 opened this issue Oct 2, 2023 · 13 comments
Closed

Enhancing Kedro Support for Nested Folder Structures #3105

Gabriel2409 opened this issue Oct 2, 2023 · 13 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@Gabriel2409
Copy link

Gabriel2409 commented Oct 2, 2023

Description

This issue stems from #3096 and proposes an enhancement to Kedro to better support working with nested folder structures. While addressing this, we need to ensure that pipeline names remain unique independently of the folder they are in, especially given the new flat parameter structure introduced in 0.18.13.

Context

In big projects, pipelines tend to be grouped into their own folder. However, kedro cli and the new find_pipelines function currently don't support it, making going from flat to nested structure a bit painful sometimes.

Possible Implementation

I've outlined the proposed implementation for this feature in the bullet points below, and I'm eager to hear your thoughts and suggestions. PR is available here: #3106 . Thank you very much.

Create Pipeline

kedro pipeline create myfolder.mypipeline should:

  • Create a pipeline in src/<project_name>/pipelines/myfolder/mypipeline.
  • Generate test files in src/tests/pipelines/myfolder/mypipeline.
  • Create a parameter file mypipeline.yml in conf/base/parameters_mypipeline.yml to adhere to the new flat structure. Note that myfolder does not appear in the name of the file to avoid having very long file names.
  • Check for conflicting pipelines and return an error on conflict. I am in favor of globally unique pipeline names, which means no pipelines with the same name in two different subfolders, so for example kedro pipeline create myfolder.mypipeline followed by kedro pipeline create myotherfolder.mypipeline should result in an error
  • There should be no constraint on the structure, allowing commands like kedro pipeline create parent_pipeline followed by kedro pipeline create parent_pipeline.child_pipeline to work. Running the commands in the opposite order should work as well. Note that on this specific point, we could also force all the pipelines to be at the same level in the hierarchy.

Delete Pipeline

kedro pipeline delete myfolder.mypipeline should delete the pipeline, parameter, and test files as before. However it should fail if a child pipeline is detected.
For example, running kedro pipeline create myfolder.mypipeline, then kedro pipeline create myfolder.mypipeline.musubpipeline and then kedro pipeline delete myfolder.mypipeline should fail and ask the user to first delete myfolder.mypipeline.musubpipeline

Integration with find_pipelines

  • find_pipelines should support nested structures, with pipeline names excluding parent folders to avoid excessively long commands. If a pipeline name appears twice, it should raise an error.
  • For example, running kedro pipeline create myfolder.child_pipeline and then manually creating a pipeline in another_folder.child_pipeline should trigger a failure in find_pipelines. Note that this is an extra security as it should already be impossible to create said pipeline with kedro pipeline create another_folder.child_pipeline
@Gabriel2409 Gabriel2409 added the Issue: Feature Request New feature or improvement to existing feature label Oct 2, 2023
@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Oct 2, 2023
@astrojuanlu
Copy link
Member

Thanks @Gabriel2409 for the detailed writeup! Does #2701 help with your feature request?

@astrojuanlu
Copy link
Member

See also #2543

@Gabriel2409
Copy link
Author

Hi @astrojuanlu
Thank you for your reply.

Maybe I did not understand them correctly but it seems these issues are related to overwriting the default template.
My intention is not to modify or overwrite the default template for creating pipelines. Instead, I am simply looking for a way to create pipelines within subfolders without any alterations to the template itself.

I hope this clarifies my request.

@merelcht
Copy link
Member

merelcht commented Oct 10, 2023

Thank you for the clear proposal @Gabriel2409. I think it all makes a lot of sense. The only part I'm not 100% sure about is:

Check for conflicting pipelines and return an error on conflict. I am in favor of globally unique pipeline names, which means no pipelines with the same name in two different subfolders, so for example kedro pipeline create myfolder.mypipeline followed by kedro pipeline create myotherfolder.mypipeline should result in an error.

To me personally, this makes total sense and is probably a necessary restriction to avoid confusing for e.g. pipeline parameter files. But my uncertainty comes from not knowing enough about how most users structure their pipelines. Would this break existing pipeline set up, or is this nested structure something users haven't generally implemented because Kedro doesn't support it so well?

I'd love to get some more thoughts on this. Tagging: @marrrcin, @Galileo-Galilei, @deepyaman, @yetudada

@MatthiasRoels
Copy link

This would be a super nice feature to have! I was experimenting with the same idea (I use nested folders regularly to bring additional structure in the project), but I haven’t gotten nearly as far as you did.

@merelcht as long as this is just an extension of the current way of creating pipelines, i.e. not using subfolder behaves exactly the same, this is not a breaking change. User can stil create their own custom file structure without using find_pipelines and any kedro pipeline cmd. As long as the registry gets populated properly that is…

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jan 3, 2024
@deepyaman
Copy link
Member

Sorry that I didn't see this previously! I completely agree that this is a very useful feature to have. If I recall correctly (it's been a few years!), when I used to lead development of CustomerOne (a McKinsey-internal asset), we had extended Kedro to support nested hierarchy, because of some of the reasons you mention. On the whole, I'd personally be very supportive of this.

Check for conflicting pipelines and return an error on conflict. I am in favor of globally unique pipeline names, which means no pipelines with the same name in two different subfolders, so for example kedro pipeline create myfolder.mypipeline followed by kedro pipeline create myotherfolder.mypipeline should result in an error

I personally would rather not restrict this. I think you may artificially need to come up with different names for pipelines that are similar in function, just in different top-level packages. I think the parameters/catalog hierarchies could be created to deal with this pretty easily.

@Gabriel2409
Copy link
Author

Hello everyone,
Thank you for your feedbacks.

@deepyaman Here is my reasonning on the proposed limitations:

Parameter files flat structure

kedro now has a flat structure for the parameter files: when creating a pipeline, a file parameters_mypipeline.yml is created.
As such, if we create two pipelines with the same name, there is a potential conflict. Of course in the context of a nested file structure, we could include the full path in the file name but it may lead to very long file names.

Registered name in find_pipelines

If we want to support different names, we must change find_pipelines to register the full path. Instead of using kedro run mypipeline, we would need to use kedro run path.to.my.pipeline, which might be a bit overkill. Moreover, it is inconsistent with a parameter file that only includes the pipeline name.


I see 3 possibilities for the final design:

  1. Prevent duplicate pipeline names: Keep parameter files flat structure. Pipeline is fully described with its name
  2. Authorize duplicate pipeline names: Here we must decide if we modify the parameter file name for ex by adding a suffix (parameters_mypipeline.yml and parameters_mypipeline-2.yml) or something else. Moreover, we must decide how the pipeline is registered in find_pipelines (mypipeline and mypipeline-2?)
  3. Warn on conflict and prompt user to use a different name for find_pipelines (but in this case I fail to see the interest vs just renaming the pipeline)

What do you all think?

@merelcht
Copy link
Member

Hi @Gabriel2409, thanks so much for your further comments. We have discussed this issue as a team and these were some of the points that came up:

  • We don't have a good view on how projects are structures right now. Are users already using nested structures and if so how are they structuring their projects? And would this implementation break their flows?
  • How many levels of nesting would we allow? Could be unlimited, or we add a restriction sticking to what we as Kedro think is "best practice".
  • The unique name limitation seems to be too restrictive.
  • We'll need to check how this impacts the micropackaging workflows.

We do like this proposal as a team, but would only invest in having it fully implemented if there's enough interest from users. With regards to that, would you be able to help with that and gauge interest through our user-research channel on Slack? We can connect there and I can explain more 🙂

@Gabriel2409
Copy link
Author

@merelcht Thank you for your reply.

I am ok with continuing the discussion on slack. Could you please share the link?

@merelcht
Copy link
Member

@merelcht Thank you for your reply.

I am ok with continuing the discussion on slack. Could you please share the link?

Yes, it's https://slack.kedro.org/. The channel for user research is called #user-research

@merelcht
Copy link
Member

Hi @Gabriel2409 it's been a while. Are you still interested in this topic?

@Gabriel2409
Copy link
Author

Hi @merelcht I am still interested in the feature but I don't really have time to work on it for the moment.
In light of what you said in January, I think we can close the issue and the associated PR.

@merelcht
Copy link
Member

Hi @merelcht I am still interested in the feature but I don't really have time to work on it for the moment. In light of what you said in January, I think we can close the issue and the associated PR.

No worries! In that case, I suggest moving this to github discussions so interested people can continue the discussion there 🙂

@kedro-org kedro-org locked and limited conversation to collaborators May 24, 2024
@merelcht merelcht converted this issue into discussion #3891 May 24, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
None yet
Development

No branches or pull requests

5 participants