concurrency: make it work #345

cmcarthur · 2017-03-21T20:10:28Z

Changes:

Use imap_unordered to return results as they are ready. Remove manual batching done before the run starts. (dbt test shows great results with this approach because there's only one run level.)
Use a unique transaction per model. The new integration test proves that this works.
Implement connection pooling. This works effectively the same way as psycopg2.pool.ThreadedConnectionPool.

drewbanin

this is really incredible! Some questions:

have you done any work to quantify the dbt run speedup this effects?
did you validate that this properly skips dependent models on failure?
which data warehouses have you tested this with?

Couple of comments mostly for my edification. Excited to get this merged in!

drewbanin · 2017-03-24T04:44:56Z

dbt/adapters/default.py

+        if connections_in_use.get(name):
+            return connections_in_use.get(name)
+
+        if recache_if_missing is False:


i think it's more pythonic to do if not recache if missing here

drewbanin · 2017-03-24T04:47:23Z

dbt/adapters/default.py

+from dbt.schema import Column
+
+
+lock = multiprocessing.Lock()


these will be singletons -- is that ok?

yeah, I think that's what we want right now. does that seem right to you?

one other thing here. I thought a little bit about how we could support connections of multiple types in a single run -- if we were to do that, we'd need to change this and the cache to be unique-per-connection-type.

hah i wouldn't worry about multiple connection types for now. That sounds right to me, just wanted to confirm because we had some problems with single-imports back in the day

drewbanin · 2017-03-24T04:49:20Z

dbt/adapters/default.py

+        # we add a magic number, 2 because there are overhead connections,
+        # one for pre- and post-run hooks and other misc operations that occur
+        # before the run starts, and one for integration tests.
+        max_connections = profile.get('threads', 1) + 2


this is kind of funky, what's going on here?

this connection pool doesn't do any sophisticated retry logic if a new connection is available. it just has a fixed number of connections. if you try to acquire a connection but they are all already in use, you get an exception.

in addition to one thread per model, we need some overheard connections. one is 'master' which is used for pre- and post-run-hooks, getting the list of existing tables before model runs, creating the schema, etc.

the other is for testing, which is kind of dumb.

this code doesn't exactly cap the number of connections to the number of threads, but it does make sure that connections don't grow in an unbounded fashion.

got it, thanks

drewbanin · 2017-03-24T04:59:27Z

dbt/runner.py

+            error = ("Internal error executing {filepath}\n\n{error}"
+                     "\n\nThis is an error in dbt. Please try again. If "
+                     "the error persists, open an issue at "
+                     "https://github.com/fishtown-analytics/dbt").format(


drewbanin · 2017-03-24T05:00:29Z

dbt/runner.py

+                         error=str(e).strip())
+            status = "ERROR"
+            if type(e) == psycopg2.InternalError and \
+               ABORTED_TRANSACTION_STRING == e.diag.message_primary:


i think we'll see this a lot less often now -- is it still worth including?

great point... let me think on this a little bit

i think we can leave it in for now. but if overall error handling has improved, i don't think this particular error message is incredibly helpful (since each model gets its own transaction)

cmcarthur · 2017-03-24T13:42:39Z

have you done any work to quantify the dbt run speedup this effects?

I haven't, it's difficult to do so with my project because we have an XS snowflake warehouse, and performance has been noticeably different with multiple simultaneous transactions. I asked @jthandy to do some benchmarking on it with a Redshift project once it's in dev.

did you validate that this properly skips dependent models on failure?

Yeah, but I guess we should have an integration test for this. I added one for the inverse, i.e. if a model fails, other models in the same run level don't fail.

which data warehouses have you tested this with?

Snowflake & Postgres

Connor McArthur added 15 commits March 21, 2017 16:10

integration tests passing, round 1 (no pooling yet)

aca45f0

fix unit tests

ecd1def

connection pooling

5567be3

getting close

f37c9ba

print result line before next model start line

dca4bb0

fix unit tests -- date_function/type were unbound

d2705ae

fix unicode logging issue -- integration-py27 passing now

b204b52

fix snowflake static fns (type, date_function)

181f272

always call USE SCHEMA

cbcd571

actually specify a schema

5b7110e

comeon snowflake

539ed9c

infinite recursion does happen

b064a27

tests fixed

e03b768

new-style class issues

c98400d

touch up connection cleanup logic

efedc30

cmcarthur requested a review from drewbanin March 22, 2017 19:26

drewbanin reviewed Mar 24, 2017

View reviewed changes

pythonic recache_if_missing, add test for skipped deps

da192d0

cmcarthur mentioned this pull request Mar 24, 2017

dbt run batches shouldn't share transactions #336

Closed

changelog

7623e9c

cmcarthur merged commit d35249c into development Mar 24, 2017

cmcarthur deleted the fix/concurrency branch March 24, 2017 14:40

yu-iskw pushed a commit to yu-iskw/dbt that referenced this pull request Aug 17, 2021

Small typo fix (dbt-labs#345)

489a44b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrency: make it work #345

concurrency: make it work #345

cmcarthur commented Mar 21, 2017 •

edited

Loading

drewbanin left a comment •

edited

Loading

drewbanin Mar 24, 2017

drewbanin Mar 24, 2017

cmcarthur Mar 24, 2017

drewbanin Mar 24, 2017

drewbanin Mar 24, 2017 •

edited

Loading

cmcarthur Mar 24, 2017

drewbanin Mar 24, 2017

drewbanin Mar 24, 2017

drewbanin Mar 24, 2017 •

edited

Loading

cmcarthur Mar 24, 2017

drewbanin Mar 24, 2017

cmcarthur commented Mar 24, 2017

concurrency: make it work #345

concurrency: make it work #345

Conversation

cmcarthur commented Mar 21, 2017 • edited Loading

drewbanin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmcarthur commented Mar 24, 2017

cmcarthur commented Mar 21, 2017 •

edited

Loading

drewbanin left a comment •

edited

Loading

drewbanin Mar 24, 2017 •

edited

Loading

drewbanin Mar 24, 2017 •

edited

Loading