Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Changing the recommendation on how to set up a project. #570

Open
tobiasraabe opened this issue Mar 10, 2024 · 10 comments
Open

ENH: Changing the recommendation on how to set up a project. #570

tobiasraabe opened this issue Mar 10, 2024 · 10 comments
Labels
blocked This issue is blocked enhancement New feature or request feedback-wanted Feedback from everyone is requested.

Comments

@tobiasraabe
Copy link
Member

Is your feature request related to a problem?

The documentation contains a couple of sections where the project structure is explained.

All of them propose to structure the project using an src layout (good) where the tasks are within the project folder (bad).

Why is this bad?

  • You cannot use pip install . to install the project but must use the editable mode.

    Why? If you use the normal installation, the paths SRC and BLD defined in config.py will be relative to the installed package path (like /mambaforge/envs/my_project/lib/python-3.11/site-packages/my_project/). It means the data is assumed to lie somewhere there.

    Of course, you could add the data to your Python project via MANIFEST.in, but then the data would be copied over to the environment directory on every install, which can be very expensive.

  • The data should not be part of the application.

Describe the solution you'd like

The new structure I propose is this one.

my_project
│
├───.pytask
│
├───bld
│   └────...
│
├───data
│   └────...
│
├───src
│   └───my_project
│       ├────__init__.py
│       └────data_preparation.py
│
├───tasks
│   ├────config.py
│   └───data_preparation
│       └────task_data_preparation.py
│
└───pyproject.toml

  1. Tasks are moved to a separate folder, tasks, just like tests.
  2. Data is moved to data, out of src.

API breaking implications

None.

Describe alternatives you've considered

None.

Additional Context

Popular templates for data science projects also keep

@tobiasraabe tobiasraabe added enhancement New feature or request feedback-wanted Feedback from everyone is requested. labels Mar 10, 2024
@hmgaudecker
Copy link
Contributor

This sounds very helpful in general. A couple of questions, remarks:

  1. To what extent is:

    You cannot use pip install . to install the project but must use the editable mode.

    an actual problem in practice? That is, when would I want to use elements of a package without tasks and data? In general, I think these workflows are typically very different from (developer) tests.

  2. A possible answer might be: "Because the package will generically prepare a dataset, specific elements will then be re-used in downstream packages." We have that structure very frequently (e.g., there is a repo cleaning the SOEP data, say clean_soep, which will then be used in specific research projects). We actually had a discussion on that last week, which was initiated by @timmens and @ChristianZimpelmann

    For that use case, I actually think the tasks should lie next to the modules, you would not want to re-write them for specific projects. Instead, they should become part of the grand DAG.

    So there would need to be an API for getting at the config data (path to the original data, build directory, data catalogs). Maybe the downstream package would need to guarantee that from config import BLD, path_to_soep, soep_catalog works. And one would need a way to tell pytask where to look for task files in the clean_soep package.

  3. Thinking out loud a bit more about the tests analogy, often this might be a case that is closer aligned with numpy.test() than with developer tests that are not shipped with the package. So maybe an interface would be possible that allows supplying BLD, path_to_soep, soep_catalog or whatever is to some function of the clean_soep package, which can be injected into the DAG of the downstream package?

@ChristianZimpelmann
Copy link
Contributor

ChristianZimpelmann commented Mar 20, 2024

That is, when would I want to use elements of a package without tasks and data?

I think that is an important question.

In my projects so far the non-task code in the src folder tends to be closely linked to the tasks and wouldn't be used without them. All code for which that is independent of the specific tasks in this project is (or is planed to be) a separate package (without tasks attached) which is imported in my_project.

@tobiasraabe
Copy link
Member Author

tobiasraabe commented May 19, 2024

Hi! Thank you both for your input and sorry for being quiet about this issue.

Since I started the discussion here, some developments have made me wonder whether the changes would substantially improve the situation.

  • rye and pixi have been released and use editable installs. So, it is a concept users do not need to understand since they hide it. Accidentally installing a project without the editable mode should happen less frequently.
  • I switched the default build system to hatchling, which does not require maintaining a MANIFEST.in since it comes with more sensible defaults.

A more serious point is that I added support for Dask and coiled in pytask-parallel for the upcoming v0.5 release. There is a problem with how pytask imports task modules and dask that requires task modules to live in a package (https://github.com/dask/distributed/issues/8607). So, moving tasks outside of a package is no longer an option for me. Supporting Dask and coiled has higher priority.

@hmgaudecker
Copy link
Contributor

Great, thanks!

Can we brainstorm at some point about how to accommodate multiple pytask-powered packages in one DAG? (my point 2. above) This really is a very common situation here and I am still looking for the "right" way to do it.

@tobiasraabe
Copy link
Member Author

tobiasraabe commented May 23, 2024

I have not needed this feature so far, but it is an interesting problem. What have you tried already?

I would probably approach it with

  • Clone your current project to a folder alongside the upstream projects (like soep).
  • In your current project, set task_files = ["src", "../soep/src"] in pyproject.toml
  • Set BLD_SOEP = SRC / ".." / ".." / "soep" / "bld"

Software environment-wise, the whole story would be much easier if pixi had a workspace feature like rye.

@tobiasraabe tobiasraabe added the blocked This issue is blocked label Jun 7, 2024
@felixschmitz
Copy link

felixschmitz commented Jun 18, 2024

Thanks for the suggestion!

  • I set up two fresh projects using the template (dependent_project and independent_project) and placed them in the same directory.
  • In the dependent_project, I updated the paths variable to ["./src/dependent_project", "../independent_project/src/independent_project"] (did try changing task_files as well, but this made little sense to me per documentation)

Before interacting in any other way with ** independent_project** from ** dependent_project** and running pytask, I received the following error:

Screenshot 2024-06-18 at 14 42 22

It seems like pytask is struggling with loading modules from the independent_project.

@ChristianZimpelmann
Copy link
Contributor

Have you installed both projects or added the path to independent_project to your Python path in some other way?

@felixschmitz
Copy link

Yes, both projects are installed in editable mode.

@tobiasraabe
Copy link
Member Author

Still sounds like the most likely cause. Have you activated the environment? Have you checked whether the packages show up in conda list or whatever tool you are using? Can you import the project just from the Python console? Have you set up the package metadata right in pyproject.toml?

@felixschmitz
Copy link

Thanks for the further hints. Used pip install -e ., but the wrong env was activated at the time of execution. Everything working now. So, the suggested approach above works. The "grant DAG" in the dependent_project now also displays all tasks from independent_project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked This issue is blocked enhancement New feature or request feedback-wanted Feedback from everyone is requested.
Projects
None yet
Development

No branches or pull requests

4 participants