Feature/cache relations (#911) #1025

beckjake · 2018-09-25T22:52:14Z

Relation Caching!

Perform caching on relations within a single dbt run.

Currently only the list_relations and get_relation adapter methods read from the cache, and set_relations_cache, drop_relation, rename_relation, and cache_new_relation write to the cache.

This PR also removes list_relations from the list of wrapped (and therefore supported) adapter methods - it's no longer available in jinja code. This PR makes list_relations a bit misleading, as discussed above - it's really only valid for checking the existence of downstream models. See caveats for details...

On a somewhat disappointing note, the tests did not get faster, but on a positive note drew tested it with a large real project and saw improvements (~25%, I think).

new flags/configuration

You can disable using the cache with --bypass-cache, a new flag.
You can enable extremely verbose cache logging with --log-cache-events, a new hidden flag. Integration tests turn this flag on.

Caveats:

The cache is not a real cache and isn't reliable for everything, in particular:

Models that fail at create time will still be in the relations list
- this is because the cache doesn't know about transactions, and the cache add happens before the actual create statement runs
- this should not matter due to dbt's ordering/dependency guarantees
when the current dbt run itself creates a view based on a table, the cache will not know that dropping the table will cascade to the view
- adding this information reliably would be very hard
- this also should not matter due to dbt's ordering/dependency guarantees

Given dbt's guarantees, the cache is valid for all relations downstream from the currently-executing model, however models that are currently in an error state and upstream models may be incorrect (either in the cache incorrectly or removed from the cache incorrectly).

Bounus

You can now examine the generated template code as you step through templates in pdb/ipdb. It's not the most intuitive code to step through, but it can help with tracking down issues triggered inside templates

cmcarthur

@beckjake i'm still working on reviewing this but i'm just going to post my comments to start the conversation. i want to take some extra time to understand what's happening in the cache class, and also i want to think a little more deeply about how the relations cache / get relations relates to the catalog, especially in how we implement case-insensitive schema/table comparison logic in multiple places now. but i don't want to block for another day or two on me thinking about that.

cmcarthur · 2018-10-03T13:47:29Z

dbt/adapters/bigquery/impl.py

-            schema, identifier, relations_list,
-            model_name)
+        table = self.get_bq_table(schema, identifier)
+        return self.bq_table_to_relation(table)


I like this approach.

Can you make get_bq_table and bq_table_to_relation clearly part of the private API of this adapter?

cmcarthur · 2018-10-03T13:48:49Z

dbt/adapters/default/impl.py

-        node.schema.lower()
-        for node in manifest.nodes.values()
-    })
+    schemas = frozenset(s.lower() for s in manifest.get_used_schemas())


cmcarthur · 2018-10-03T13:49:59Z

dbt/adapters/bigquery/impl.py


    def drop_relation(self, relation, model_name=None):
+        self.cache.drop(schema=relation.schema, identifier=relation.identifier)


you want if dbt.flags.USE_CACHE around this, no?

cmcarthur · 2018-10-03T13:51:22Z

dbt/adapters/default/impl.py

+    def get_relation(self, schema, identifier, model_name=None):
+        relations_list = self.list_relations(schema, model_name)
+
+        matches = self._make_match(relations_list, schema, identifier)


I had imagined this would go directly to the cache, skipping list_relations since we are planning to deprecate that. I guess this is functionally the same, but is also a little confusing. I think it'd be cleaner to hit the cache here directly

In the case where USE_CACHE is false, or the schema is not in the cache, that won't work. We can look it up in the cache first and fall back to list_relations, if you prefer that, but going through list_relations is unavoidable to some degree.

Ah, yeah, you are right. Thanks.

cmcarthur · 2018-10-03T13:53:07Z

dbt/adapters/postgres/impl.py

+    def _link_cached_relations(self, manifest, schemas):
+        # now set up any links
+        try:
+            table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME)


this is surprisingly easy. love it.

cmcarthur · 2018-10-03T13:53:50Z

dbt/adapters/postgres/impl.py

+        try:
+            table = self.run_operation(manifest, GET_RELATIONS_OPERATION_NAME)
+            # avoid a rollback when releasing the connection
+            self.commit_if_has_connection(GET_RELATIONS_OPERATION_NAME)


a rollback should work here, and i think it's less error-prone? just in case someone customizes the get_relations_data operation to actually modify the warehouse

good catch, this is from when I was trying to have the cache handle rollbacks (a bad idea!) and I wanted to minimize the number of them.

cmcarthur · 2018-10-03T13:55:27Z

dbt/adapters/postgres/impl.py

+        table = self._relations_filter_table(table, schemas)
+
+        for (refed_schema, refed_name, dep_schema, dep_name) in table:
+            self.cache.add_link(dep_schema, dep_name, refed_schema, refed_name)


note to myself to come back here and look at this again

cmcarthur · 2018-10-03T13:59:41Z

dbt/main.py

+        '--log-cache-events',
+        action='store_true',
+        help=argparse.SUPPRESS,
+    )


i wonder if this would be better implemented as a --trace flag or something. i'm sure there are other places where we'd like to log a lot more info, e.g. connection pool management

Or maybe some sort of --log=dbt.cache, where log is a repeatable argument that enables log propagation for the given package? Usually when I want more granular logging, one thing I don't want is more granular logging everywhere.

Hmmm maybe @drewbanin would have something more useful to say here. My specific concern is having proliferating set of flags related to logging specific event types. Do you have a use case in mind for --log=dbt.cache? Is that for ease of development?

Yeah. This logging is really only useful for debugging narrow cache-related issues. I think --log=dbt.cache and stuff like it would probably have to be a whole new PR, the way we currently set up logging doesn't really play nice with that structure.

Yeah, the cache logging here is way too verbose to be useful in the default case. I like the idea of turning on logs per-module, but that seems more useful for developers of dbt than users of dbt itself. We could also make it a config in profiles.yml I suppose? Like:

config: logging: modules: ['dbt.cache', 'dbt.whatever'] ...

I imagine there's other logging things to configure too. Regardless, don't know that we need to implement it in this PR.

I agree that --log-cache-events feels weird, but since it's set to "suppressed", I feel great about removing it in the future if we do something more comprehensive around logging.

Yeah. I don't think anyone should ever be passing --log-cache-events in production, unless maybe we ask someone to do so as part of tracking down a cache consistency issue. It's nice to have it for integration tests though, I've already tracked down an intermittent cache bug due to the extra output.

cmcarthur · 2018-10-04T11:46:26Z

dbt/adapters/cache.py

+        self.schema = schema
+        self.identifier = identifier
+        self.referenced_by = {}
+        self.inner = inner


is there a situation where self.schema would not be equivalent to self.inner.schema, and self.identifier would not be equivalent to self.inner.identifier? seems like they are redundant

cmcarthur · 2018-10-04T11:49:22Z

dbt/adapters/cache.py

+            schema=schema,
+            identifier=identifier,
+            inner=inner
+        )


related to the previous comment -- couldn't the api here be simply def add(self, relation)

cmcarthur · 2018-10-04T11:52:26Z

dbt/adapters/cache.py

+        relation = self.relations.pop(old_key)
+
+        # change the old_relation's name and schema to the new relation's
+        relation.rename(new_key)


i think you want to grab the re-entrant lock around lines 377-380 to make the rename appear atomic across threads

WAIT i am confusing cache.rename() with cachedrelation.rename(). this is perfect actually, you're already locking around this whole fn.

cmcarthur · 2018-10-04T12:03:04Z

dbt/adapters/cache.py

+        """Clear the cache"""
+        with self.lock:
+            self.relations.clear()
+            self.schemas.clear()


in general, I think you did a nice job with this module. asks:

a lot of the comments repeat what is in the docstrings. can you give this a once over and clean some of those up?

these classes contain public, private, and unit-test-only APIs. as a general rule, I don't like providing functions for unit tests only as I think it complicates the APIs. but the functions here used for unit tests only are marked as such so idc so much. can you just give this a once over and make sure that the public and private APIs are clearly designataed as such?

it seems like there is a LOT of redundant logging that was probably useful during development, but should perhaps be removed now. can you give these cache logger debug calls a once over?

finally these APIs work with both (schema,identifier) pairs and relations. i'd prefer to use relations where possible. the best example here that i can see is rename() -- we specifically switched adapter.rename to use relations instead of (schema,identifier) pairs, so it seems undesirable to me to have the cache class use old_schema,old_identifier,new_schema,new_identifier as arguments

cmcarthur · 2018-10-04T12:13:59Z

dbt/clients/jinja.py

+            # put ourselves in the cache using the 'lazycache' method
+            linecache.cache[filename] = (lambda: source,)
+
+        return super(MacroFuzzEnvironment, self)._compile(source, filename)


i'd like to talk to you about this for 5min. i understand the idea on a basic level but am not familiar with the python internals here

drewbanin · 2018-10-04T13:17:17Z

dbt/adapters/cache.py

+        if self.inner:
+            # Relations store this stuff inside their `path` dict. But they
+            # also store a table_name, and usually use it in their  .render(),
+            # so we need to update that as well. It doesn't appear that


table_name used to conditionally be name + '__dbt_tmp, but we've since removed that, so now table_name == identifier I believe. One thing to watch out for here might be ephemeral models... I need to do more digging to see if they're a relevant concern here, but wanted to surface it.

drewbanin · 2018-10-04T13:26:57Z

dbt/adapters/default/impl.py

+    def cache_new_relation(self, relation):
+        """Cache a new relation in dbt. It will show up in `list relations`."""
+        if relation is None:
+            dbt.exceptions.raise_compiler_error()


can you add a small message here?

drewbanin · 2018-10-04T13:31:44Z

dbt/adapters/default/impl.py

+            dbt.exceptions.raise_compiler_error()
+        if dbt.flags.USE_CACHE:
+            self.cache.add(
+                schema=relation.schema,


I think Connor indicated this above, but i think it would be wise to operate at a higher level of abstraction here. In the future, we're probably going to make it possible to configure the database (or project) that models get rendered into, and I imagine we won't want to go back and refactor caching when that change occurs.

Would it make sense to just pass in a relation here, and somehow make the Relation responsible for reporting it's own db/schema/identifier to the cache?

drewbanin · 2018-10-04T13:33:34Z

dbt/adapters/default/impl.py

@@ -216,6 +246,13 @@ def rename(self, schema, from_name, to_name, model_name=None):

    def rename_relation(self, from_relation, to_relation,
                        model_name=None):
+        if dbt.flags.USE_CACHE:
+            self.cache.rename(


same as above, I think this should just operate on relations, not their schemas/identifiers

drewbanin · 2018-10-04T13:36:05Z

dbt/adapters/default/impl.py

+                )
+            return False
+        else:
+            return True


Is this method only intended to return if a schema is cached? I imagined that the else branch would check if model_name is present in the cache from looking at the signature and how it's used

The cache only tracks 'in' status on a schema level, it's impossible to know if an entry is unknown to the cache or actually does not exist. The model_name bit is just for logging.

I'll rename the method to try to communicate that better.

drewbanin · 2018-10-04T13:38:49Z

dbt/adapters/default/impl.py

+                    inner=relation
+                )
+        self._link_cached_relations(manifest, schemas)
+        # it's possible that there were no relations in some schemas. We want


this is a good catch

drewbanin · 2018-10-04T13:45:23Z

dbt/include/global_project/macros/relations/postgres_relations.sql

+             referenced_class.schema != dependent_class.schema)
+    )
+
+    select


i'm not sure if this is significant, but is there any chance that these relationships can be duplicated? Like if you join a table to itself in a view:

-- models/downstream_view.sql select t1.name, t2.name from some_table t1 join some_table t2 on t1.parent_id = t2.child_id

Does that create two different entries in the internal relationships table between downstream_view and some_table? Regardless, might be worth distinct-ing the results here?

nvm, you're grouping by all four so that will distinct the records

cmcarthur · 2018-10-08T18:20:49Z

this is approved contigent on:

@drewbanin 's approval in general
also his specific approval on adding the new command line option to control cache-related debug logging

drewbanin · 2018-10-08T18:44:39Z

I would really like to spend more time testing this before we merge! Are other things blocking on this PR? I've been meaning to stress test it for a while now, and will hopefully have an opportunity to dig into it in the next day or two

beckjake · 2018-10-08T19:01:03Z

I have a branch that handles #1035 and #963 that's based off this branch, and most of my other cards have to do with adapters as well, so I was going to base them off that branch.

drewbanin · 2018-10-09T03:03:13Z

It looks to me like caching fails hard when views select from tables in schema that aren't operated by dbt. A view model like:

select * from public.event -- this is a simple table I manually created

will succeed the first time around, and then the second run of dbt will fail with:

Cache inconsistency detected: in add_link, referenced link key ReferenceKey(schema='public', identifier='event') not in cache!

When views reference relations defined outside of any dbt schemas, I think caching should just ignore them, right? In this case, public would be any source data schema, so this would be a really common paradigm, and one that we should add good tests for. I will say that --bypass-cache seems to be working though :)

I'd like to run our Internal Analytics project against this branch too, but there's an unrelated bug that's preventing me from doing that. I'll open a separate issue.

… linecache

…e mandatory

…nsactions

…dd "in" operator support to the cache

beckjake · 2018-10-09T13:25:58Z

Ooh, nice bug. I think that means we just need to change how add_link works to where if the referenced relation's schema is not in the cache, we just continue. We really only care if we're linking to a table we control.

…to it

beckjake · 2018-10-09T14:44:20Z

Ok, fixed, and I added a test that exposes the problem (and any similar issues around external references)

drewbanin · 2018-10-12T19:26:37Z

Cool! Let's get #1048 merged, then I'll be able to smoke test this with a couple of redshift projects.

drewbanin · 2018-10-12T21:12:45Z

Ok. Let's merge this. Once it's in dev/guion-bluford, we'll be able to get many more other folks to help out with testing too. I'm really excited about this!

beckjake · 2018-10-12T21:24:49Z

Fixes #911

beckjake requested review from cmcarthur and drewbanin September 25, 2018 22:52

cmcarthur reviewed Oct 3, 2018

View reviewed changes

cmcarthur requested changes Oct 4, 2018

View reviewed changes

cmcarthur reviewed Oct 4, 2018

View reviewed changes

drewbanin reviewed Oct 4, 2018

View reviewed changes

beckjake force-pushed the feature/cache-relations branch from 5c34a3c to fb7dcab Compare October 4, 2018 20:39

cmcarthur approved these changes Oct 8, 2018

View reviewed changes

Jacob Beck added 18 commits October 9, 2018 07:23

First pass on caching

ccee039

add cache verification flag

32765ed

tons of logging

cf77a9a

make jinja templates kinda debuggable by injecting ourselves into the…

9f5040d

… linecache

More test tweaks

2e4bc56

run most tests in verify mode, at least for now

b7b03c7

pep8, skip text generation in optimized mode

418f4ad

add cache clearing

5163529

slightly better error

6e30cd8

turn off verification-by-default as threading makes it hard

6f43c8f

add exceptions, remove "kind" field, make the inner Relation referenc…

d61b28e

…e mandatory

remove verify-relation-cche as it is a bad idea due to concurrent tra…

69cbb60

…nsactions

make cache logging a toggle

d359d05

add missing import

fc146be

on failure, do not clear the cache

9b3df57

add --bypass-cache flag that ignores using the cache

4d0abe0

expose get_relation

2a7cebc

Add _is_cached function, optimize get_relation for cache bypassing, a…

f4afd49

…dd "in" operator support to the cache

Jacob Beck added 10 commits October 9, 2018 07:23

remove "list_relations" calls in materializations

f8a78c3

go through list_relations for get_relation again, sadly

2cbae63

comment/todos/errors cleanup

a1cc37c

make jinja debugging unconditional

7be9115

docstrings

3883ad3

tests

57d814f

PR feedback

780512c

handle a subtle iteration issue in the cache with extra locking

a03ca11

move warning filtering into compatibility module

8377522

fix __copy__/__deepcopy__

ce660cb

Jacob Beck added 2 commits October 9, 2018 08:42

add failing external references test

4c928c6

if a referenced schema is not in the cache, do not try to add a link …

c85cb43

…to it

beckjake force-pushed the feature/cache-relations branch from 868032c to c85cb43 Compare October 9, 2018 14:43

Merge branch 'dev/guion-bluford' into feature/cache-relations

0c7ef07

beckjake merged commit 618dee0 into dev/guion-bluford Oct 12, 2018

beckjake deleted the feature/cache-relations branch October 12, 2018 21:24

beckjake mentioned this pull request Oct 15, 2018

Cache relations in DBT #911

Closed

drewbanin mentioned this pull request Oct 23, 2018

"dbt run" triggers identical information schema query for each model #660

Closed


		def drop_relation(self, relation, model_name=None):
		self.cache.drop(schema=relation.schema, identifier=relation.identifier)

Feature/cache relations (#911) #1025

Feature/cache relations (#911) #1025

Conversation

beckjake commented Sep 25, 2018 • edited Loading

Relation Caching!

new flags/configuration

Caveats:

Bounus

cmcarthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmcarthur commented Oct 8, 2018 • edited Loading

drewbanin commented Oct 8, 2018

beckjake commented Oct 8, 2018

drewbanin commented Oct 9, 2018

beckjake commented Oct 9, 2018

beckjake commented Oct 9, 2018

drewbanin commented Oct 12, 2018

drewbanin commented Oct 12, 2018

beckjake commented Oct 12, 2018

beckjake commented Sep 25, 2018 •

edited

Loading

cmcarthur commented Oct 8, 2018 •

edited

Loading