Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency: make it work #345

Merged
merged 17 commits into from
Mar 24, 2017
Merged

concurrency: make it work #345

merged 17 commits into from
Mar 24, 2017

Conversation

cmcarthur
Copy link
Member

@cmcarthur cmcarthur commented Mar 21, 2017

Changes:

  • Use imap_unordered to return results as they are ready. Remove manual batching done before the run starts. (dbt test shows great results with this approach because there's only one run level.)
  • Use a unique transaction per model. The new integration test proves that this works.
  • Implement connection pooling. This works effectively the same way as psycopg2.pool.ThreadedConnectionPool.

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really incredible! Some questions:

  • have you done any work to quantify the dbt run speedup this effects?
  • did you validate that this properly skips dependent models on failure?
  • which data warehouses have you tested this with?

Couple of comments mostly for my edification. Excited to get this merged in!

if connections_in_use.get(name):
return connections_in_use.get(name)

if recache_if_missing is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's more pythonic to do if not recache if missing here

from dbt.schema import Column


lock = multiprocessing.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will be singletons -- is that ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think that's what we want right now. does that seem right to you?

one other thing here. I thought a little bit about how we could support connections of multiple types in a single run -- if we were to do that, we'd need to change this and the cache to be unique-per-connection-type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah i wouldn't worry about multiple connection types for now. That sounds right to me, just wanted to confirm because we had some problems with single-imports back in the day

# we add a magic number, 2 because there are overhead connections,
# one for pre- and post-run hooks and other misc operations that occur
# before the run starts, and one for integration tests.
max_connections = profile.get('threads', 1) + 2
Copy link
Contributor

@drewbanin drewbanin Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of funky, what's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this connection pool doesn't do any sophisticated retry logic if a new connection is available. it just has a fixed number of connections. if you try to acquire a connection but they are all already in use, you get an exception.

in addition to one thread per model, we need some overheard connections. one is 'master' which is used for pre- and post-run-hooks, getting the list of existing tables before model runs, creating the schema, etc.

the other is for testing, which is kind of dumb.

this code doesn't exactly cap the number of connections to the number of threads, but it does make sure that connections don't grow in an unbounded fashion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks

error = ("Internal error executing {filepath}\n\n{error}"
"\n\nThis is an error in dbt. Please try again. If "
"the error persists, open an issue at "
"https://github.com/fishtown-analytics/dbt").format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

dbt/runner.py Outdated
error=str(e).strip())
status = "ERROR"
if type(e) == psycopg2.InternalError and \
ABORTED_TRANSACTION_STRING == e.diag.message_primary:
Copy link
Contributor

@drewbanin drewbanin Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we'll see this a lot less often now -- is it still worth including?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point... let me think on this a little bit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can leave it in for now. but if overall error handling has improved, i don't think this particular error message is incredibly helpful (since each model gets its own transaction)

@cmcarthur
Copy link
Member Author

have you done any work to quantify the dbt run speedup this effects?

I haven't, it's difficult to do so with my project because we have an XS snowflake warehouse, and performance has been noticeably different with multiple simultaneous transactions. I asked @jthandy to do some benchmarking on it with a Redshift project once it's in dev.

did you validate that this properly skips dependent models on failure?

Yeah, but I guess we should have an integration test for this. I added one for the inverse, i.e. if a model fails, other models in the same run level don't fail.

which data warehouses have you tested this with?

Snowflake & Postgres

@cmcarthur cmcarthur merged commit d35249c into development Mar 24, 2017
@cmcarthur cmcarthur deleted the fix/concurrency branch March 24, 2017 14:40
yu-iskw pushed a commit to yu-iskw/dbt that referenced this pull request Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants