-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cache relations (#911) #1025
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beckjake i'm still working on reviewing this but i'm just going to post my comments to start the conversation. i want to take some extra time to understand what's happening in the cache class, and also i want to think a little more deeply about how the relations cache / get relations relates to the catalog, especially in how we implement case-insensitive schema/table comparison logic in multiple places now. but i don't want to block for another day or two on me thinking about that.
dbt/adapters/bigquery/impl.py
Outdated
schema, identifier, relations_list, | ||
model_name) | ||
table = self.get_bq_table(schema, identifier) | ||
return self.bq_table_to_relation(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach.
Can you make get_bq_table
and bq_table_to_relation
clearly part of the private API of this adapter?
node.schema.lower() | ||
for node in manifest.nodes.values() | ||
}) | ||
schemas = frozenset(s.lower() for s in manifest.get_used_schemas()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
dbt/adapters/bigquery/impl.py
Outdated
|
||
def drop_relation(self, relation, model_name=None): | ||
self.cache.drop(schema=relation.schema, identifier=relation.identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you want if dbt.flags.USE_CACHE
around this, no?
def get_relation(self, schema, identifier, model_name=None): | ||
relations_list = self.list_relations(schema, model_name) | ||
|
||
matches = self._make_match(relations_list, schema, identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had imagined this would go directly to the cache, skipping list_relations
since we are planning to deprecate that. I guess this is functionally the same, but is also a little confusing. I think it'd be cleaner to hit the cache here directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where USE_CACHE is false, or the schema is not in the cache, that won't work. We can look it up in the cache first and fall back to list_relations, if you prefer that, but going through list_relations is unavoidable to some degree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, you are right. Thanks.
def _link_cached_relations(self, manifest, schemas): | ||
# now set up any links | ||
try: | ||
table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is surprisingly easy. love it.
dbt/adapters/postgres/impl.py
Outdated
try: | ||
table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME) | ||
# avoid a rollback when releasing the connection | ||
self.commit_if_has_connection(GET_RELATIONS_OPERATION_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a rollback should work here, and i think it's less error-prone? just in case someone customizes the get_relations_data operation to actually modify the warehouse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, this is from when I was trying to have the cache handle rollbacks (a bad idea!) and I wanted to minimize the number of them.
dbt/adapters/postgres/impl.py
Outdated
table = self._relations_filter_table(table, schemas) | ||
|
||
for (refed_schema, refed_name, dep_schema, dep_name) in table: | ||
self.cache.add_link(dep_schema, dep_name, refed_schema, refed_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to myself to come back here and look at this again
'--log-cache-events', | ||
action='store_true', | ||
help=argparse.SUPPRESS, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this would be better implemented as a --trace
flag or something. i'm sure there are other places where we'd like to log a lot more info, e.g. connection pool management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe some sort of --log=dbt.cache
, where log
is a repeatable argument that enables log propagation for the given package? Usually when I want more granular logging, one thing I don't want is more granular logging everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm maybe @drewbanin would have something more useful to say here. My specific concern is having proliferating set of flags related to logging specific event types. Do you have a use case in mind for --log=dbt.cache
? Is that for ease of development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This logging is really only useful for debugging narrow cache-related issues. I think --log=dbt.cache
and stuff like it would probably have to be a whole new PR, the way we currently set up logging doesn't really play nice with that structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the cache logging here is way too verbose to be useful in the default case. I like the idea of turning on logs per-module, but that seems more useful for developers of dbt than users of dbt itself. We could also make it a config
in profiles.yml
I suppose? Like:
config:
logging:
modules: ['dbt.cache', 'dbt.whatever']
...
I imagine there's other logging things to configure too. Regardless, don't know that we need to implement it in this PR.
I agree that --log-cache-events
feels weird, but since it's set to "suppressed", I feel great about removing it in the future if we do something more comprehensive around logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I don't think anyone should ever be passing --log-cache-events
in production, unless maybe we ask someone to do so as part of tracking down a cache consistency issue. It's nice to have it for integration tests though, I've already tracked down an intermittent cache bug due to the extra output.
self.schema = schema | ||
self.identifier = identifier | ||
self.referenced_by = {} | ||
self.inner = inner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a situation where self.schema
would not be equivalent to self.inner.schema
, and self.identifier
would not be equivalent to self.inner.identifier
? seems like they are redundant
dbt/adapters/cache.py
Outdated
schema=schema, | ||
identifier=identifier, | ||
inner=inner | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to the previous comment -- couldn't the api here be simply def add(self, relation)
relation = self.relations.pop(old_key) | ||
|
||
# change the old_relation's name and schema to the new relation's | ||
relation.rename(new_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you want to grab the re-entrant lock around lines 377-380 to make the rename appear atomic across threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAIT i am confusing cache.rename() with cachedrelation.rename(). this is perfect actually, you're already locking around this whole fn.
"""Clear the cache""" | ||
with self.lock: | ||
self.relations.clear() | ||
self.schemas.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, I think you did a nice job with this module. asks:
- a lot of the comments repeat what is in the docstrings. can you give this a once over and clean some of those up?
- these classes contain public, private, and unit-test-only APIs. as a general rule, I don't like providing functions for unit tests only as I think it complicates the APIs. but the functions here used for unit tests only are marked as such so idc so much. can you just give this a once over and make sure that the public and private APIs are clearly designataed as such?
- it seems like there is a LOT of redundant logging that was probably useful during development, but should perhaps be removed now. can you give these cache logger debug calls a once over?
- finally these APIs work with both (schema,identifier) pairs and relations. i'd prefer to use relations where possible. the best example here that i can see is
rename()
-- we specifically switchedadapter.rename
to use relations instead of (schema,identifier) pairs, so it seems undesirable to me to have the cache class useold_schema,old_identifier,new_schema,new_identifier
as arguments
# put ourselves in the cache using the 'lazycache' method | ||
linecache.cache[filename] = (lambda: source,) | ||
|
||
return super(MacroFuzzEnvironment, self)._compile(source, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to talk to you about this for 5min. i understand the idea on a basic level but am not familiar with the python internals here
dbt/adapters/cache.py
Outdated
if self.inner: | ||
# Relations store this stuff inside their `path` dict. But they | ||
# also store a table_name, and usually use it in their .render(), | ||
# so we need to update that as well. It doesn't appear that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_name
used to conditionally be name + '__dbt_tmp
, but we've since removed that, so now table_name == identifier
I believe. One thing to watch out for here might be ephemeral models... I need to do more digging to see if they're a relevant concern here, but wanted to surface it.
dbt/adapters/default/impl.py
Outdated
def cache_new_relation(self, relation): | ||
"""Cache a new relation in dbt. It will show up in `list relations`.""" | ||
if relation is None: | ||
dbt.exceptions.raise_compiler_error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a small message here?
dbt/adapters/default/impl.py
Outdated
dbt.exceptions.raise_compiler_error() | ||
if dbt.flags.USE_CACHE: | ||
self.cache.add( | ||
schema=relation.schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Connor indicated this above, but i think it would be wise to operate at a higher level of abstraction here. In the future, we're probably going to make it possible to configure the database
(or project
) that models get rendered into, and I imagine we won't want to go back and refactor caching when that change occurs.
Would it make sense to just pass in a relation
here, and somehow make the Relation responsible for reporting it's own db/schema/identifier to the cache?
dbt/adapters/default/impl.py
Outdated
@@ -216,6 +246,13 @@ def rename(self, schema, from_name, to_name, model_name=None): | |||
|
|||
def rename_relation(self, from_relation, to_relation, | |||
model_name=None): | |||
if dbt.flags.USE_CACHE: | |||
self.cache.rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, I think this should just operate on relations, not their schemas/identifiers
) | ||
return False | ||
else: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method only intended to return if a schema
is cached? I imagined that the else
branch would check if model_name
is present in the cache from looking at the signature and how it's used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache only tracks 'in' status on a schema level, it's impossible to know if an entry is unknown to the cache or actually does not exist. The model_name bit is just for logging.
I'll rename the method to try to communicate that better.
inner=relation | ||
) | ||
self._link_cached_relations(manifest, schemas) | ||
# it's possible that there were no relations in some schemas. We want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good catch
referenced_class.schema != dependent_class.schema) | ||
) | ||
|
||
select |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure if this is significant, but is there any chance that these relationships can be duplicated? Like if you join a table to itself in a view:
-- models/downstream_view.sql
select t1.name, t2.name
from some_table t1
join some_table t2 on t1.parent_id = t2.child_id
Does that create two different entries in the internal relationships table between downstream_view
and some_table
? Regardless, might be worth distinct-ing the results here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, you're grouping by all four so that will distinct the records
5c34a3c
to
fb7dcab
Compare
this is approved contigent on:
|
I would really like to spend more time testing this before we merge! Are other things blocking on this PR? I've been meaning to stress test it for a while now, and will hopefully have an opportunity to dig into it in the next day or two |
It looks to me like caching fails hard when views select from tables in schema that aren't operated by dbt. A view model like:
will succeed the first time around, and then the second run of dbt will fail with:
When views reference relations defined outside of any dbt schemas, I think caching should just ignore them, right? In this case, I'd like to run our Internal Analytics project against this branch too, but there's an unrelated bug that's preventing me from doing that. I'll open a separate issue. |
…dd "in" operator support to the cache
Ooh, nice bug. I think that means we just need to change how add_link works to where if the referenced relation's schema is not in the cache, we just continue. We really only care if we're linking to a table we control. |
868032c
to
c85cb43
Compare
Ok, fixed, and I added a test that exposes the problem (and any similar issues around external references) |
Cool! Let's get #1048 merged, then I'll be able to smoke test this with a couple of redshift projects. |
Ok. Let's merge this. Once it's in |
Fixes #911 |
Relation Caching!
Perform caching on relations within a single dbt run.
Currently only the
list_relations
andget_relation
adapter methods read from the cache, andset_relations_cache
,drop_relation
,rename_relation
, andcache_new_relation
write to the cache.This PR also removes
list_relations
from the list of wrapped (and therefore supported) adapter methods - it's no longer available in jinja code. This PR makeslist_relations
a bit misleading, as discussed above - it's really only valid for checking the existence of downstream models. See caveats for details...On a somewhat disappointing note, the tests did not get faster, but on a positive note drew tested it with a large real project and saw improvements (~25%, I think).
new flags/configuration
You can disable using the cache with
--bypass-cache
, a new flag.You can enable extremely verbose cache logging with
--log-cache-events
, a new hidden flag. Integration tests turn this flag on.Caveats:
The cache is not a real cache and isn't reliable for everything, in particular:
create
time will still be in the relations listGiven dbt's guarantees, the cache is valid for all relations downstream from the currently-executing model, however models that are currently in an error state and upstream models may be incorrect (either in the cache incorrectly or removed from the cache incorrectly).
Bounus
You can now examine the generated template code as you step through templates in
pdb
/ipdb
. It's not the most intuitive code to step through, but it can help with tracking down issues triggered inside templates