Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-748] [Enhancement] dbt deps should resolve multiple packages with same git: repo and unique subdirectory: #5374

Closed
1 task done
GClunies opened this issue Jun 15, 2022 · 19 comments · Fixed by #8322
Closed
1 task done
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors paper_cut A small change that impacts lots of users in their day-to-day

Comments

@GClunies
Copy link

GClunies commented Jun 15, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I have a single GitHub repo where I have defined multiple private dbt packages (i.e. a monorepo with >1 dbt packages) in a dbt_packages/ directory. Each dbt package is in it's own subdirectory in the repo like so:

image

Each dbt package contains models for Segment event data, modeled in a structure like so:

reflekt_surfline_web
├── analyses
├── macros
├── models
│   │   docs
│   │   ├── _src_surfline_web.yml
│   │   ├── _stg_surfline_web__accepted_subscription_offer.yml
│   │   ├── _stg_surfline_web__account_created.yml
│   │   ├── _stg_surfline_web__added_card.yml
│   │   ├── _stg_surfline_web__added_product.yml
│   │   ├── _stg_surfline_web__added_session.yml
│   │   └── ... many more docs
│   ├── stg_surfline_web__accepted_subscription_offer.sql
│   ├── stg_surfline_web__account_created.sql
│   ├── stg_surfline_web__added_card.sql
│   ├── stg_surfline_web__added_product.sql
│   ├── stg_surfline_web__added_session.sql
│   └── ... many more models
├── seeds
├── snapshots
├── tests
├── .gitignore
├── dbt_project.yml
└── README.md

In my dbt project, I have uniquely referenced each package in my packages.yml like so:

packages:
  - package: calogica/dbt_date  # This includes dbt-labs/dbt-utils
    version: [">=0.5.0", "<0.6.0"]

  - package: dbt-labs/dbt_utils  # This is included due to a bug in union_relations for v0.8.1
    version: 0.8.0

  - git: "https://github.com/Surfline/wavetrak-segment"      # Private GitHub repo 
    subdirectory: "dbt_packages/reflekt_surfline_android"  # Unique subdirectory holding private dbt package
    revision: v0.1.0__reflekt_surfline_android                         # Git tag pointing to commit where package was built

  - git: "https://github.com/Surfline/wavetrak-segment"
    subdirectory: "dbt_packages/reflekt_surfline_ios"
    revision: v0.1.0__reflekt_surfline_ios

  - git: "https://github.com/Surfline/wavetrak-segment"
    subdirectory: "dbt_packages/reflekt_surfline_web"
    revision: v0.1.0__reflekt_surfline_web

I would have thought that with this information, dbt would be able to pull each private dbt package from my repo, its respective subdirectory:, and revision: (Git tag), but instead I get the error.

❯ dbt deps
03:10:52  Running with dbt=1.0.1
03:10:54  Encountered an error:
git dependencies should contain exactly one version. https://github.com/Surfline/wavetrak-segment contains: {'v0.1.0__reflekt_surfline_android', 'v0.1.0__reflekt_surfline_ios', 'v0.1.0__reflekt_surfline_web'}

I am actually pointing dbt to 3 separate dbt packages in the same repo, which are unique identified by the config in my packages.yml, but dbt seems to confused by this. Perhaps I am pushing the limits of how the git: config in packages.yml was intended to be used?

I am not super familiar with the dbt-core codebase, but the error appears to be originating from the logic here.

    def resolved(self) -> GitPinnedPackage:
        requested = set(self.revisions)
        if len(requested) == 0:
            requested = {"HEAD"}
        elif len(requested) > 1:
            raise_dependency_error(
                "git dependencies should contain exactly one version. "
                "{} contains: {}".format(self.git, requested)
            )

Expected Behavior

With the packages.yml provide in the previous section of this issue, I would expect dbt to be able to resolve each package like so:

🕙 20:30:55 ✖  dbt deps
04:22:27  Running with dbt=1.0.1
04:22:36  Installing calogica/dbt_date
04:22:37    Installed from version 0.5.7
04:22:37    Up to date!
04:22:37  Installing dbt-labs/dbt_utils
04:22:38    Installed from version 0.8.0
04:22:38    Updated version available: 0.8.5
04:22:38  Installing dbt-labs/audit_helper
04:22:38    Installed from version 0.5.0
04:22:38    Up to date!
04:22:38  Installing dbt-labs/codegen
04:22:39    Installed from version 0.5.0
04:22:39    Updated version available: 0.6.0
04:22:39  Installing https://github.com/Surfline/wavetrak-segment
04:22:40    Installed from revision v0.1.0__reflekt_surfline_android
04:22:40     and subdirectory dbt_packages/reflekt_surfline_android
04:22:41  Installing https://github.com/Surfline/wavetrak-segment
04:22:42    Installed from revision v0.1.0__reflekt_surfline_ios
04:22:42     and subdirectory dbt_packages/reflekt_surfline_ios
04:22:43  Installing https://github.com/Surfline/wavetrak-segment
04:22:44    Installed from revision v0.1.0__reflekt_surfline_web
04:22:44     and subdirectory dbt_packages/reflekt_surfline_web

It seems strange to me that this pattern does not work for private packages specified using the git:. Having to maintain each private package in a separate repo does not seem practical.

Steps To Reproduce

  1. Put multiple private dbt packages in a single repo, each in their own subdirectory.
  2. Reference >1 private dbt package from the same repo in a downstream dbt project, using the dbt docs on Git Packages and project subdirectories as a guide.
  3. Run dbt deps
  4. Error!

Relevant log output

The logs were pretty light when I looked

============================== 2022-06-15 04:34:19.887142 | 39c78805-7315-4790-8776-691648e8a383 ==============================
04:34:19.887142 [info ] [MainThread]: Running with dbt=1.0.1
04:34:19.887621 [debug] [MainThread]: running dbt with arguments Namespace(record_timing_info=None, debug=None, log_format=None, write_json=None, use_colors=None, printer_width=None, warn_error=None, version_check=None, partial_parse=None, single_threaded=False, use_experimental_parser=None, static_parser=None, profiles_dir='/Users/gclunies2-mba/.dbt', send_anonymous_usage_stats=None, fail_fast=None, event_buffer_size=None, project_dir=None, profile=None, target=None, vars='{}', log_cache_events=False, defer=None, state=None, cls=<class 'dbt.task.deps.DepsTask'>, which='deps', rpc_method='deps')
04:34:19.887820 [debug] [MainThread]: Tracking: tracking
04:34:19.910745 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1125e94c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11260f6d0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11260fbe0>]}
04:34:19.913571 [debug] [MainThread]: Set downloads directory='/var/folders/k7/1zsppsn511bdd73clcsj_f_w0000gn/T/dbt-downloads-z8ep59ut'
04:34:19.915299 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/index.json
04:34:20.199197 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/index.json 200
04:34:20.199932 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/calogica/dbt_date.json
04:34:20.444646 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/calogica/dbt_date.json 200
04:34:20.454957 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/calogica/dbt_date/0.5.7.json
04:34:20.695370 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/calogica/dbt_date/0.5.7.json 200
04:34:20.696113 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
04:34:20.938174 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json 200
04:34:20.961995 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils/0.8.0.json
04:34:21.159002 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils/0.8.0.json 200
04:34:21.159579 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/audit_helper.json
04:34:21.348866 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/audit_helper.json 200
04:34:21.352812 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/audit_helper/0.5.0.json
04:34:21.549011 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/audit_helper/0.5.0.json 200
04:34:21.549574 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/codegen.json
04:34:21.755913 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/codegen.json 200
04:34:21.759732 [debug] [MainThread]: Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/codegen/0.5.0.json
04:34:21.960422 [debug] [MainThread]: Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/codegen/0.5.0.json 200
04:34:21.961332 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1125e94c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1125eafa0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1125eaaf0>]}

Environment

- OS: Mac
- Python: 3.9.12
- dbt: 1.0.1

What database are you using dbt with?

redshift

Additional Context

dbt deps works when I specifying a single dbt packages from the Git repo.

packages:
  - package: calogica/dbt_date  # This includes dbt-labs/dbt-utils
    version: [">=0.5.0", "<0.6.0"]

  - package: dbt-labs/dbt_utils  # This is included due to a bug in union_relations for v0.8.1
    version: 0.8.0

  - package: dbt-labs/audit_helper
    version: 0.5.0

  - package: dbt-labs/codegen
    version: 0.5.0

  # Reflekt dbt packages
  - git: "https://github.com/Surfline/wavetrak-segment"
    subdirectory: "dbt_packages/reflekt_surfline_android"
    revision: v0.1.0__reflekt_surfline_android
❯ dbt deps     
04:39:39  Running with dbt=1.0.1
04:39:49  Installing calogica/dbt_date
04:39:49    Installed from version 0.5.7
04:39:49    Up to date!
04:39:49  Installing dbt-labs/dbt_utils
04:39:50    Installed from version 0.8.0
04:39:50    Updated version available: 0.8.5
04:39:50  Installing dbt-labs/audit_helper
04:39:51    Installed from version 0.5.0
04:39:51    Up to date!
04:39:51  Installing dbt-labs/codegen
04:39:51    Installed from version 0.5.0
04:39:51    Updated version available: 0.6.0
04:39:51  Installing https://github.com/Surfline/wavetrak-segment
04:39:53    Installed from revision v0.1.0__reflekt_surfline_android
04:39:53     and subdirectory dbt_packages/reflekt_surfline_android
04:39:53  
04:39:53  Updates available for packages: ['dbt-labs/dbt_utils', 'dbt-labs/codegen']                 
Update your versions in packages.yml, then run dbt deps
@GClunies GClunies added bug Something isn't working triage labels Jun 15, 2022
@github-actions github-actions bot changed the title [Bug] dbt deps cannot resolve multiple private packages in same git: repo with unique subdirectory: and revision: configurations [CT-748] [Bug] dbt deps cannot resolve multiple private packages in same git: repo with unique subdirectory: and revision: configurations Jun 15, 2022
@GClunies GClunies changed the title [CT-748] [Bug] dbt deps cannot resolve multiple private packages in same git: repo with unique subdirectory: and revision: configurations [CT-748] [Bug] dbt deps cannot resolve multiple private packages in same git: repo with unique subdirectory: and revision: config Jun 15, 2022
@GClunies GClunies changed the title [CT-748] [Bug] dbt deps cannot resolve multiple private packages in same git: repo with unique subdirectory: and revision: config [CT-748] [Bug] dbt deps cannot resolve multiple packages in packages.yml when in same git: repo with unique subdirectory: and revision: configs Jun 15, 2022
@GClunies GClunies changed the title [CT-748] [Bug] dbt deps cannot resolve multiple packages in packages.yml when in same git: repo with unique subdirectory: and revision: configs [CT-748] [Bug] dbt deps cannot resolve multiple packages in packages.yml when in same git: repo but with unique subdirectory: and revision: configs Jun 15, 2022
@jtcohen6 jtcohen6 added Team:Language deps dbt's package manager labels Jun 15, 2022
@jtcohen6
Copy link
Contributor

@GClunies Thanks for the clear write-up!

I think it's fair to say that this isn't part of the intended or expected support that we have for git packages today. I'm not strictly opposed to ever supporting it, but I'm going to reclassify this as an enhancement rather than a bug.

To make a long story short, the git dependency came first, and the subdirectory specification came (much) later. We had some sensible logic in place to detect duplicate git dependencies (the code you spotted). The addition of the subdirectory option added flexibility to git packages, but it did not update that logic. I suppose it could, and the change wouldn't be too involved—but I am wary of the increasing complexity here. What if one of the subdirectories depends on another subdirectory, and they're both specified in your packages.yml? (#4975)

In the meantime, I think you might be able to get this working by adding a new "project" within https://github.com/Surfline/wavetrak-segment, that installs the others as local dependencies—assuming you're able to use a single revision (a.k.a. "release") for all of them—and then installing that single "bundled" package via git in your root project's packages.yml. The relative paths may get a bit tricky, so I'm not sure if that's more or less expedient than splitting apart the monorepo, for packages that really do need to be installed (and versioned!) independently.

@jtcohen6 jtcohen6 added enhancement New feature or request and removed bug Something isn't working triage labels Jun 15, 2022
@jtcohen6 jtcohen6 changed the title [CT-748] [Bug] dbt deps cannot resolve multiple packages in packages.yml when in same git: repo but with unique subdirectory: and revision: configs [CT-748] [Enhancement] dbt deps should resolve multiple packages with same git: repo and unique subdirectory: Jun 15, 2022
@GClunies
Copy link
Author

GClunies commented Jun 15, 2022

Thanks for getting back to me so quickly @jtcohen6, I really appreciate it.

I had previously thought that the git and subdirectory specifications were "connected" in a way to uniquely identify a package (a big assumption that I did not test), but your explanation makes sense as to how the current functionality came to be. Would it be possible to make it clearer in the docs that a git repo can only have a single package for the time being?

To provide some context as to how I got here, the dbt packages described in my original issue are being created by Reflekt, an open-source tool I am developing to template dbt packages that model and document the events in a tracking plan from tools like Avo and Segment Protocols. My original framework was that each tracking plan (if there were multiple) would have its own corresponding dbt package, hence multiple packages in a single repo (that holds a Reflekt project). Seems that assumption I made has come back to bite me!

As for solutions to my issue, I see 3 options:

  1. As you suggested, add a new "project" within https://github.com/Surfline/wavetrak-segment that installs the Reflekt dbt packages using local. Conceptually I get it, but feel like this makes it hard to visualize the "chain" of packages in my downstream dbt project.
  2. I open a PR to modify the logic in the code I spotted. Goal would be to uniquely identify packages in private repos using the combination of git, subdirectory, and revision. This would be my first adventure into the dbt-core codebase, so if you don't think this is appropriate for a good_first_issue, I will avoid this. As you called out, it seems a dbt user could (unknowingly?) get into some gnarly dependencies if [CT-438] dbt deps should detect cyclical dependencies #4975 was not addressed before such a change.
  3. I change the logic in Reflekt so that its output is a single dbt package, modeling and documenting the events from multiple tracking plans, likely in their own subdirectories (really coming full circle here 🔁) and still leveraging git tags when any update is made to the (now) single package.

I think my safest bet is option 3. Option 2 is tempting, but also feels like a rabbit hole.

@leahwicz
Copy link
Contributor

@GClunies I'm going to close this out since it looks like you implemented Option 3 above. If you disagree or still have an ask, feel free to re-open

@GClunies
Copy link
Author

GClunies commented Jun 23, 2022

Hi @leahwicz, I did implement option 3 above. But this was only in response to the unexpected behavior I outlined in the original issue description. I think in an ideal world, dbt-core would support the expected behavior I described (that's my ask). I personally feel it's a valid enhancement request.

But I can empathize that solving edge cases like this is likely not a priority. Maybe an update to the docs clarifying the limitations of the subdirectory config would be helpful? That would have steered me to Option 3 in the first place.

@leahwicz
Copy link
Contributor

Sorry about that @GClunies! I am opening this back up to keep track of the feature request. We most likely won't be able to prioritize this in the near future so I'm going to put the help_wanted label on it so if somebody feels inclined, they can pick it up and start working on it. I'll also go back and look at making the documentation clearer so thanks for pointing that out!

@leahwicz leahwicz reopened this Jun 23, 2022
@leahwicz leahwicz added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Jun 28, 2022
@amirbtb
Copy link

amirbtb commented Aug 24, 2022

Hey, I also run into the same issue. My team designed a GitHub repository with a subfolder per dbt project (with a dbt project per data source) + 1 release per dbt_project. We just discovered that referencing multiple packages that are subfolders of the same repo doesn't work out.
This feature could be very useful for teams with a high number of repositories who want to centralize all dbt projects in one git repo.

@dlawrences
Copy link

Is this planned for any near future delivery? In the docs, it feels like dbt is hinting this should work with a monorepo - doesn't specifically talk about reaching out for multiple dbt projects within the monorepo from a single dependent dbt project, but it is unexpected and should ideally be made part of the docs.

@philippeboyd
Copy link
Contributor

Hey, I also run into the same issue. My team designed a GitHub repository with a subfolder per dbt project (with a dbt project per data source) + 1 release per dbt_project. We just discovered that referencing multiple packages that are subfolders of the same repo doesn't work out. This feature could be very useful for teams with a high number of repositories who want to centralize all dbt projects in one git repo.

+1 I'm in the exact same boat as @amirbtb , any news on this @jtcohen6 ?

@OatmealLick
Copy link

+1 from me as well, ran into this issue at my work

@saraleon1
Copy link

+1 dbt Cloud customer. This customer has multiple projects in a single repository to align with their internal git best practices. They have several private packages, and want to be able to include them in a main project while keeping all packages in a single directory.

@jtcohen6 jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Apr 5, 2023
@scott-anderson-ef
Copy link

+1 would be great to have this

@gem7318
Copy link
Contributor

gem7318 commented Jul 7, 2023

image

@jtcohen6 Andddd I'm here now, lmao--just painfully discovered that I can't use these like I intended to once I started layering on multiple private packages from the same repo.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 7, 2023

@gem7318 Nice to see you again, and so soon :)

Coming back to this proposal from above:

I open a PR to modify the logic in the code I spotted. Goal would be to uniquely identify packages in private repos using the combination of git, subdirectory, and revision. This would be my first adventure into the dbt-core codebase, so if you don't think this is appropriate for a good_first_issue, I will avoid this. As you called out, it seems a dbt user could (unknowingly?) get into some gnarly dependencies if #4975 was not addressed before such a change.

While I think it might be tricky — if you (or someone in this thread) wanted to take a crack at this, I wouldn't be opposed. The starting point would be incorporating the subdirectory into the uniqueness check.

It is worth thinking through the risk of cyclical dependencies, which are still an issue (I believe). As described there, it might be possible to do something nice around cycle detection / gracefully handle a recursion error.

@gem7318
Copy link
Contributor

gem7318 commented Jul 9, 2023

@gem7318 Nice to see you again, and so soon :)

Coming back to this proposal from above:

I open a PR to modify the logic in the code I spotted. Goal would be to uniquely identify packages in private repos using the combination of git, subdirectory, and revision. This would be my first adventure into the dbt-core codebase, so if you don't think this is appropriate for a good_first_issue, I will avoid this. As you called out, it seems a dbt user could (unknowingly?) get into some gnarly dependencies if #4975 was not addressed before such a change.

While I think it might be tricky — if you (or someone in this thread) wanted to take a crack at this, I wouldn't be opposed. The starting point would be incorporating the subdirectory into the uniqueness check.

It is worth thinking through the risk of cyclical dependencies, which are still an issue (I believe). As described there, it might be possible to do something nice around cycle detection / gracefully handle a recursion error.

Likewise :) And yes agreed - I'm going on vacation at the end of the week and will spend some flight-time looking at the source code trying getting my head around it.

Off the bat though I agree that cyclical dependencies is the 'gotcha' that make this of non-zero complexity.

@iamtodor
Copy link

+1 from me as well, we ran into this issue

@philippeboyd
Copy link
Contributor

Hi all, I'm simply posting an update here because I've been working on a solution which is working perfectly for our use case which is the same as @gem7318 and @amirbtb.

We also have a mono repo with multiple DBT packages inside them and we needed to have a way to be able to import multiple git packages within the same "main package"

packages:
  - package: "calogica/dbt_date"
    version: [ ">=0.5.4", "<0.8.0" ]
  - package: "dbt-labs/dbt_utils"
    version: [ ">=1.0.0", "<2.0.0" ]
  - git: "git@github.com:acme/git-repo.git"
    subdirectory: "dbt-source-pkg-1"
    revision: "dbt-source-pkg-1-v0.1.12"
  - git: "git@github.com:acme/git-repo.git"
    subdirectory: "dbt-source-pkg-2"
    revision: "dbt-source-pkg-2-v0.1.2"

The following will import correctly

❯ dbt deps
14:49:31  Running with dbt=1.5.1
14:49:41  Installing calogica/dbt_date
14:49:41  Installed from version 0.7.2
14:49:41  Up to date!
14:49:41  Installing dbt-labs/dbt_utils
14:49:42  Installed from version 1.1.1
14:49:42  Up to date!
14:49:42  Installing git@github.com:acme/git-repo.git/dbt-source-pkg-1
14:49:43  Installed from revision dbt-source-pkg-1-v0.1.12
14:49:43  and subdirectory dbt-source-pkg-1
14:49:43  Installing git@github.com:acme/git-repo.git/dbt-source-pkg-2
14:49:44  Installed from revision dbt-source-pkg-2-v0.1.2
14:49:44  and subdirectory dbt-source-pkg-2

I'll have time to create the PR tonight with updated unit tests test/unit/test_deps.py

@misteliy
Copy link

misteliy commented Sep 14, 2023

any update on this? This is an important feature for bigger projects like ours.

@dbeatty10
Copy link
Contributor

@misteliy this PR is being reviewed: #8322

@peterallenwebb
Copy link
Contributor

@philippeboyd Thanks for your implementation of this long-desired feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deps dbt's package manager enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors paper_cut A small change that impacts lots of users in their day-to-day
Projects
None yet
Development

Successfully merging a pull request may close this issue.