-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Model versions parsing #7204
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
def _repack_args(self, name: str, package: Optional[str], version: Optional[str]) -> List[str]: | ||
repacked_args = [name] if package is None else [package, name] | ||
if version: | ||
repacked_args.append(f"version:{version}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted to repack the version
arg as a formatted string to keep the structure of model.refs
as List[List[str]]
(result of this method is appended to model.refs here.
Leads to some not-so-fun string parsing when refs are processed.
Could also add it as a dictionary, which would hold all kwargs (if we ever add more) going forward. This would be a breaking change in the manifest (model.refs becomes List[List[Union[str, Dict[str,str]]]]
, but would be more straightforward (somewhat - will need to do some type checking) for consumers to parse when processing refs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction is to go with the dictionary. I think consumers are going to have to either do the same not-so-fun string parsing or checking for a dictionary anyway. I'm not sure how often external consumers would be processing the refs anyway; I think mostly they rely on us providing the references in the depends_on. @jtcohen6 might have a better idea about that than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the dictionary. IMO:
- consumers of the manifest should use
depends_on.nodes
, rather thanrefs
, in almost all cases - the list-of-lists always felt a bit janky
This may require some changes in some of the weirder packages out there, but in a way that should make their lives easier, ultimately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good to go with a dictionary as well!
What I'd originally thought up in terms of representation in the manifest.json for model.refs
was a List[List[Union[str, Dict[str,str]]]]
. For example,
"refs": ["package_name", "model_name", {"version":"3", ...}]
But what might be best is actually just a List[Dict[str, str]]
. For example,
"refs": {
"package": "package_name", # what should we actually call this key..?
"model": "model_name",
"version": "version_name"
}
That would enable us to simplify how we parse ref
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should we actually call this key..?
We're pretty fast & loose with the package
/project
distinction. I think this is called package_name
in the manifest currently:
"model.test.dim_customers": {
"resource_type": "model",
"name": "dim_customers",
"package_name": "test",
...
}
And... it will continue to mean either/both things, since we'll support resolving ref
erences to a model in another package
(status quo) and another project
(via cross-project ref
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for List[Dict[str, str]] too. It's always a bit messy to have two different types. We're not allowing versioned sources, right? Sources currently List[List[str]] but always have two strings in the list, whereas refs only have one. Eventually it might be nice to switch all of those things to dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not allowing versioned sources, right?
right - agree it'd be nice to switch them all to dictionaries. I can make an issue for sources, perhaps we can tackle it sa follow-on tech debt work.
|
||
# refables are actually unique, so the Dict[PackageName, UniqueID] will | ||
# only ever have exactly one value, but doing 3 dict lookups instead of 1 | ||
# is not a big deal at all and retains consistency | ||
def __init__(self, manifest: "Manifest"): | ||
self.storage: Dict[str, Dict[PackageName, UniqueID]] = {} | ||
self.storage: Dict[str, Dict[PackageName, List[UniqueID]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref
lookups now need to handle 3 access patterns (package
handling remains unchanged):
- Lookup versioned model by name and version => lookup specified version
- Lookup versioned model by name without version => get the latest version
- Lookup unversioned model by name only => get only 'version'
For example, a manifest with the following nodes:
"model.jaffle_shop.dim_unversioned": {
"name": "dim_unversioned",
"version": null,
"is_latest_version": null
},
"model.jaffle_shop.dim_customers.v1": {
"name": "dim_customers",
"version": "1",
"is_latest_version": false
},
"model.jaffle_shop.dim_customers.v2": {
"name": "dim_customers",
"version": "2",
"is_latest_version": true
}
should find unique_ids for the following refs via RefableLookup.get_unique_id
:
ref("dim_unversioned") => "model.jaffle_shop.dim_unversioned"
ref("dim_customers", version="1") => "model.jaffle_shop.dim_customers.v1"
ref("dim_customers") => "model.jaffle_shop.dim_customers.v2"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered a few options for the storage
structure here:
- Dict[str, Dict[PackageName, List[UniqueID]]]
- "latest" unique_id stored at the front of list
- first element of
UniqueID
list always returned, unless aversion
is specified - in which case iterating through the list is necessary. - implemented in this draft
{
"dim_unversioned": {
"jaffle_shop": ["model.jaffle_shop.dim_unversioned"]
}, "dim_customers": {
"jaffle_shop": ["model.jaffle_shop.dim_customers.v2", "model.jaffle_shop.dim_customers.v1"]
}
}
- Dict[str, Dict[PackageName, Dict[VersionName, UniqueID]]]
- doesn't handle unversioned or "latest" versioned lookups models nicely
{
"dim_unversioned": {
"jaffle_shop": {
"??": "model.jaffle_shop.dim_unversioned"
},
},
"dim_customers": {
"jaffle_shop": {
"2": "model.jaffle_shop.dim_customers.v2",
"1": "model.jaffle_shop.dim_customers.v1"
}
}
}
- Dict[str, Dict[PackageName, UniqueID]]
- doesn't handle "latest" versioned lookups models nicely
{
"dim_unversioned": {
"jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}
Open to feedback and suggestions here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third option, we could store an additional key:value pair for each versioned model that maps the model name to its latest unique id. For example,
{
"dim_unversioned": {
"jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
},
"dim_customers": {
"jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}
This has the benefits of:
- keeping the storage data structure + lookup logic consistent across different lookup implementations
- mapping closely with the access patterns => easier to reason about + constant-time lookup for all access patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the ref_lookup code is going to have to handle external references eventually too, right? Not yet sure how that would impact these design choices.
I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me. It doesn't feel like checking the versions would be that much overhead. And it's always going to be numerical order, so the highest integer is always latest, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me.
Same here. I'm leaning to the structure proposed above for those reasons.
And it's always going to be numerical order, so the highest integer is always latest, right?
This is something I haven't put in much polish on for the spike yet but - but we'll be accepting strings for the version identifier, as well as a model-level latest_version
. If the latest_version is provided - then the latest is that value, regardless of what the highest version is (to support the concept of prereleases).
If latest_version is not provided, dbt will compute a default latest version by attempting to order the version identifiers numerically if possible (casting to int/float), and fallback to alphanumeric ordering. This is to support a variety of versioning conventions out of the box, even if our recommendation is to use major versioning only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the latest proposed structure is appealing.
access=versioned_model_node.access, | ||
version=unparsed_version.name, | ||
is_latest_version=latest_version == unparsed_version.name, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of these fields need a closer inspection - but the gist of it is that the versioned model node should inherit properties and configs from the base node patch (where the versions
block is configured), extending or overriding them as appropriate.
is_latest_version=latest_version == unparsed_version.name, | ||
) | ||
versioned_model_node.patch(versioned_model_patch) | ||
self.patch_node_config(versioned_model_node, versioned_model_patch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alias gets updated here
versioned_model_node.patch(versioned_model_patch) | ||
self.patch_node_config(versioned_model_node, versioned_model_patch) | ||
self.manifest.rebuild_ref_lookup() # TODO: is this necessary? | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't handle parsing any tests
defined on versions within the versions
block yet
config=unparsed_version.config, | ||
access=versioned_model_node.access, | ||
version=unparsed_version.name, | ||
is_latest_version=latest_version == unparsed_version.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opted to have an is_latest_version: bool
attribute on the ModelNode
instead of the raw latest_version: str
attribute from the 'canonical' node patch as ref lookups need to know whether a given node is the latest version (logic here)
Could alternatively keep latest_version: str
on the ModelNode
and add an is_latest_version
property if we'd like to keep the ModelNode
mapping more closely to the patch.
|
||
# patch versioned nodes | ||
versions = block.target.versions | ||
if versions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the meat and potatoes of the change. Currently it's in NonSourceParser
, which is subclassed by TestablePatchParser
for parsing models, seeds, and snapshots. Given the increasing number of model-specific patching that's emerged as part of the 'Models as APIs' work (access, constraints, now versions) - I'm thinking this could be split out into a new subclass for parsing models patches independently of seeds and snapshots. Perhaps simply ModelPatchParser
. What do you think @gshank, cc @peterallenwebb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never been enthusiastic about having all that code in common. I understand why it was that way to start with, since originally there were no differences, but it provides friction when we need and want to handle differences. I'd be in favor of having them all subclasses, even if some of them were almost duplicates to start with.
) | ||
versioned_model_node.patch(versioned_model_patch) | ||
self.patch_node_config(versioned_model_node, versioned_model_patch) | ||
self.manifest.rebuild_ref_lookup() # TODO: is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think you would need to rebuild the whole thing. There is a self.manifest.ref_lookup.add_node(...) method if you need to add the versioned nodes. It's not clear to me whether that would be necessary or not...
Got answers to how we'll implement |
resolves #6866
Description
Implements most of the spec for versions described from the discussion: #6736 (comment)
Example spec: