-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sanity tests for continuous integration #564
Comments
hey @HarlanH - in the early days of dbt, we had I like your idea of using So, this is definitely on our radar! |
We're in pretty much the same boat right now - running 'dbt compile' in CircleCI doesn't quite check enough, and running the schema/data tests doesn't seem appropriate as an issue with the data shouldn't block a build. We've had multiple requests from our data science team to be able to do a 'dry run' of dbt while developing, and I'd like to add more stringent checks after the build. |
Has there been any movement on this ticket? |
hey @chrisdhanson - no movement on this in particular, but we are in the process of reshaping dbt's materializations to support custom workflows like this one. At this point, it's not super clear to me what the desired behavior is here! Can you say a little more about how you'd like this feature to work? At Fishtown Analytics, we're in the habit of using GitHub hooks to do a full run/test of the project on every Pull Request. That's a little different than the workflow described in the comments above, but it works for us. Keen to hear your thoughts! |
@drewbanin What kinds of tests are you running? We're pretty early in our implementation but we'll be building models for an analytics database so being able to run test queries is going to be important. I can see in the built-in data tests that there are options for testing queries, but they seem to be mainly for internal consistency checks rather than verifying that the queries in the model sql files are themselves valid. Additionally, to run tests on generated tables/views, you have to generate them first and if you push changes to production that fail to build, your tests won't be much good. For example, the default behavior for the data tests is to fail if rows are returned, but that means you can't run a test query with limit 0 if the error condition isn't a SQL Exception. There's probably a way to jimmy these tests into working that way, but truthfully I haven't explored it enough yet. |
@chrisdhanson We do an entire run of the project into a sandbox schema (like We use Sinter, and in practice, it looks something like this: And in GitHub: I think the comments above are in regards to how to do this faster, as a full rebuild can take a long time for a mature dbt project. As a disclaimer: Fishtown Analytics makes Sinter. I think it's a great tool that solves a lot of these problems, but there's no reason you couldn't replicate this workflow yourself using whatever tools you prefer. We're just listening for GitHub webhooks and running some dbt commands for the most part. Hope this helps! |
One feature we'd love is the ability to use Snowflake's |
We'd love to see the ability to support BigQuery's dry run capability. This way of running a query does the following:
At Monzo, we use them extensively for validating that there are no syntax errors in queries. They're free, which is nice! 💰This would help us with fast CI runs. I've got a small patch locally that seems to work. Obviously, this is entirely BigQuery specific, and horribly breaks other things like incremental tables. adapters/bigquery/connections.py
|
Since @sjwhitworth 's presentation at the dbt meetup and a few recent "oops I forgot to test" PRs. I decided I wanna get pretty strong CI. I'd be more than happy to help with implementing Snowflake's The only caveat I foresee, is that when running tests from dbt doing this on a subsample may upset some of the tests. I haven't looked much into the materialisations of the tests. In Snowflake it seems that @drewbanin do you have any suggestions so I can get started in the right direction maybe? |
hey @bastienboutonnet - as I understand it, @sjwhitworth's approach does two things:
The
The whole post is worth reading, but the takeaway for me is that Snowflake doesn't currently provide any mechanisms for validating queries without actually running them. The One currently viable approach could be to sample in the models that select from source data. Eg:
If you wanted to get really clever, you could make a macro called |
@liveandletbri and @danieldicker implemented a It worked in that it validated the schema, but the execution time was much longer than we anticipated even when appending that to each table so we haven't been using it yet. Currently evaluating using |
@drewbanin I really really like your proposal. Indeed, great to clarify that a solution comparable to BQ's @kforeman I was thinking of indeed doing something around that. But your point about " the execution time was much longer than we anticipated" maybe suggests the benefits for this are low. Can you elaborate a bit? What were you expecting? What would be the advantage of |
@bastienboutonnet I can speak to the performance issues. We created a macro that could optionally insert
The thought was that putting in the limitations at the mention of each table, rather than at the end of the query in the form of The problem is Snowflake still performs a table scan on the table before applying the That all being said, the large majority of queries are greatly improved by adding |
Sorry, should have been clearer. We're considering using |
So I found a really interesting thing, when you bring in a table into Tableau it runs a super fast and performant query to know the columns present in the table. It achieves this with the following:
This translates in a very Sticking this to the source as @drewbanin suggested could be a really powerful and easy way to bring |
@bastienboutonnet yeah! This one is interesting -- it's something you can already do in user-space, but it's certainly a workflow that dbt could make easier to implement. I have a couple of ideas here - happy to write them down in a new issue for sure! Maybe something like "More flexible control over selecting from sources", or similar? |
@drewbanin Yeah in user space is doable but it would require every query to have some checks a bit like the incremental macro right? I was more thinking of a solution around your wrapping SQL issue (#1096). But happy to make it a different one. Whatever you think is best |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
(Note, not schema/data tests.)
We've got dbt set up with CircleCI, and right now pushes to our master branch create a new Docker image and push to production with no tests. This feels... risky. One thing we can do is to run
dbt compile
and check for errors, but it'd be great to get some simple testing of the SQL too. One option would be to compile and run the tables withEXPLAIN PLAN
instead ofCREATE TABLE
(and not to run post-hooks). That should be pretty fast, and SQL syntax (and some semantic) errors would be caught too.Would this make sense? Or something else?
The text was updated successfully, but these errors were encountered: