Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2876] [Feature] add possibility to copy dbt local packages instead of make it symlink #8223

Open
3 tasks done
iamtodor opened this issue Jul 27, 2023 · 10 comments · Fixed by #7447
Open
3 tasks done
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Refinement Maintainer input needed

Comments

@iamtodor
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Hello team,

This is a feature request that comes from the following issues/discussions:

In this piece of code https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/deps/local.py#L62

 if can_create_symlink: 
     fire_event(DepsCreatingLocalSymlink())
     system.make_symlink(src_path, dest_path)
 else: 
     fire_event(DepsSymlinkNotAvailable())
     shutil.copytree(src_path, dest_path)

we are evaluating whether symlink can be setuped. In case the OS does not support symlink (i.e. Windows) we do a deep copy.

We would like to be able to do a dbt local packages deep copy on Linux machines, for instance.

The reasons for this aim are:

Describe alternatives you've considered

As a workaround, I had to write custom Python code, that iterates thru all the packages we have, find the symlink, and make a deep copy of all the content.

Who will this benefit?

In my humble opinion, at least, GCP users would benefit a lot

Are you interested in contributing this feature?

There is a chance, with all the context and pitfalls to be provided

Anything else?

Probably, no. However, if I miss something please feel free to ask all the questions and raise all the concerns you might be having

@iamtodor iamtodor added enhancement New feature or request triage labels Jul 27, 2023
@github-actions github-actions bot changed the title [Feature] add possibility to copy dbt local packages instead of make it symlink [CT-2876] [Feature] add possibility to copy dbt local packages instead of make it symlink Jul 27, 2023
@dbeatty10 dbeatty10 self-assigned this Jul 27, 2023
@dbeatty10
Copy link
Contributor

dbeatty10 commented Jul 27, 2023

Thanks for suggesting this idea @iamtodor !

We always want to be careful with Chesterton's fence, and I don't know the original rationale that went into the decision to utilize symlinks for local packages in abdcdef.

But I'm guessing that it was to avoid duplicating data. If that were the case, then it seems totally fine to me to replace it with a deep copy instead since it wouldn't take up any more additional space than if it had been installed from a remote source. In fact, doing a deep copy seems like it would just bring local packages in line with the way other install methods are written to the file system so it seems safe rather than risky.

I'm wondering if using a deep copy might even be more safe when using dbt clean. I might be misremembering, but I seem to recall accidentally wiping out a local package that way a couple years ago.

I'm going to label this as help_wanted if you would be interested in contributing a PR.

@dbeatty10 dbeatty10 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed triage labels Jul 27, 2023
@dbeatty10 dbeatty10 removed their assignment Jul 27, 2023
@iamtodor
Copy link
Author

iamtodor commented Jul 28, 2023

Thanks for suggesting this idea @iamtodor !

We always want to be careful with Chesterton's fence, and I don't know the original rationale that went into the decision to utilize symlinks for local packages in abdcdef and

But I'm guessing that it was to avoid duplicating data. If that were the case, then it seems totally fine to me to replace it with a deep copy instead since it wouldn't take up any more additional space than if it had been installed from a remote source. In fact, doing a deep copy seems like it would just bring local packages in line with the way other install methods are written to the file system so it seems safe rather than risky.

I'm wondering if using a deep copy might even be more safe when using dbt clean. I might be misremembering, but I seem to recall accidentally wiping out a local package that way a couple years ago.

I'm going to label this as help_wanted if you would be interested in contributing a PR.

Hello @dbeatty10

Thank you for the reply! I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree()

Does it seem reasonable to you?

@jtcohen6 jtcohen6 added the deps dbt's package manager label Jul 31, 2023
@dbeatty10
Copy link
Contributor

I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree()

Does it seem reasonable to you?

#7447 proposes to just try creating a symlink and fall back to the deep copy in the case that it doesn't work (i.e., raises an error because it isn't supported on the underlying system).

This seems like it would solve the problem without the user needing to add any configuration. Would that approach work for your use case, @iamtodor?

@iamtodor
Copy link
Author

iamtodor commented Aug 8, 2023

I would like to clarify that I do not propose to remove this logic. What I propose is to add the opportunity to control the logic: let's say by default DBT creates symlinks. However, I am able to change this default behavior to install packages with shutil.copytree()
Does it seem reasonable to you?

#7447 proposes to just try creating a symlink and fall back to the deep copy in the case that it doesn't work (i.e., raises an error because it isn't supported on the underlying system).

This seems like it would solve the problem without the user needing to add any configuration. Would that approach work for your use case, @iamtodor?

Thank you @dbeatty10 for keeping your attention to this issue. The proposed way wouldn't work for us due to the fact we run dbt deps while running Jenkins CI/CD that deploys code to GCP Composer.

So for us would be really convenient to have the configuration option that we could tweak

@iamtodor
Copy link
Author

iamtodor commented Aug 9, 2023

@gshank may I ask you to re-open the issue? As I have explicitly confirmed that solution coming from #7447 would not fit our needs.
CCing @dbeatty10

@iamtodor
Copy link
Author

iamtodor commented Aug 9, 2023

@gshank @dbeatty10 Our Jenkins CI/CD OS allows us to create the symlink, hence #7447 is not applicable.

The issue is while doing copy from the Jenkins machine to GCS. I pointed out these points in my very first message:

The reasons for this aim are:

Please, re-open the issue as it still exists. These are 2 very different use cases. Shall you have more concerns or questions please let me know.

@dbeatty10 dbeatty10 reopened this Aug 9, 2023
@dbeatty10
Copy link
Contributor

Re-opening per request.

Ideally, we'd be able to implement a solution that is both simple for a user and also simple from a code maintenance perspective. To achieve both of those simultaneously, I'm attracted to just removing the symlinks altogether. Doing so would duplicate the code for installed packages, but dbt packages don't tend to take up that much space.

@iamtodor
Copy link
Author

iamtodor commented Aug 9, 2023

I'm attracted to just removing the symlinks altogether. Doing so would duplicate the code for installed packages

This is exactly what I had on my mind, but I thought it might be not the very best possible solution, that's why I didn't mention and presented it to you.

Usually, we write packages quite slim, hence I do not see any potential pitfalls in this particular implementation.

@dbeatty10 dbeatty10 added the Refinement Maintainer input needed label Aug 9, 2023
@dbeatty10
Copy link
Contributor

@graciegoheen labeling this as Refinement since I want to get your take on the following proposal:

  • just remove symlinks altogether when doing dbt deps

See discussion above for more details and justification. Assuming you will tap in others to weigh in as-needed 🙏 If desired, I can do a more detailed write-up of pros/cons.

@jaklan
Copy link

jaklan commented Mar 27, 2024

@dbeatty10 I have found this issue a bit by coincidence, but I was terrified by the proposal of removing symlinks for local packages... It's absolutely critical functionality when working with packages in monorepo setup, when you always want to refer to the state of packages at the current commit and not run dbt deps each time you rebase your branch. Not to mention it's just a de facto standard that local packages can be installed as symlinks* by package installers (pip, npm etc.).

To resolve the issue and not break any workflows - just introduce the similar option like in pip - regular installs vs editable installs:
https://pip.pypa.io/en/stable/topics/local-project-installs/#local-project-installs

*To be more precise - in the Python case it's a bit more sophisticated than relying literally on symlinks:
https://stackoverflow.com/a/77455221/6534668
and that's also the reason why editable installs work smoothly on Windows. It would be great to see a similar mechanism in dbt as we also encounter issues with our Windows developers working with local packages - if symlinks creation fails, shutil.copytree is used and changes in packages are not reflected without re-running dbt deps.

That's also why I think the current fallback approach should be completely removed - there's a significant difference between symlinking and copying files, these two mechanisms simply shouldn't be mixed when running exactly the same command. Here I would again refer to Python editable installs, when you explicitly decide which mechanism you want to use.

So to sum up:

  • symlinking mechanism definitely shouldn't be removed...
  • ...unless replaced with a solution which would be much more reliable on Windows
  • dbt users should be able to specify if they want to install given package in a regular or editable way
  • fallback mechanism should be then removed completely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Refinement Maintainer input needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants