-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2144] [Feature] Allow source freshness to be evaluated from table metadata when adapter supports it #7012
Comments
Slack thread with additional context / motivation |
Thanks for opening the issue @adamcunnington-mlg As a conversation starter, what do you think of something like this:
We need to go through the different platforms and see if we really need a @jtcohen6 do you know of other areas in dbt where a similar choice is offered? I'm wondering about the ergonomics of the |
@Fleid thanks for this - suggestion looks great to me. The only thing I was thinking about this morning - when doing the per-adapter implementation, should be careful to digest the docs and ensure that there's no caveats from the DB providers about the metadata being stale / laggy. That would significantly undermine the checks! |
That's a good point. I'd rather we don't support |
@Fleid no specific concern. I've come across this for BQ before but for attributes like storage. Having said that, a bigger concern I have right now is last modified doesn't seem to be available in the INFORMATION_SCHEMA.tables table. I'm hoping they've moved it somewhere else (it definitely used to be in the previous version, TABLES but I can't find it... hopefully your adapter team can. I can raise a support ticket with Google if it helps |
My sense is that, when you I think the more compelling aspect of this approach is the ability to grab the freshness of multiple sources from a single query. If there's an information_schema view containing a last_modified field for all tables in a database/schema, dbt could determine the freshness of all |
I like that Jeremy. A lot. But that will be a slow one. In the meantime, should we let users be clever for us at the table level? version: 2
sources:
- name: <source_name>
freshness
loaded_at_field: <column_name_or_expression_or_statement>
loaded_at_location: expression | query
tables:
- name: <table_name>
freshness
loaded_at_field: <column_name_or_expression_or_statement>
loaded_at_location: expression | query
... Knowing that To be honest, I'm not too happy about the |
This feels like a good short-term compromise to me. It could always be labelled as a beta feature and subject to change. Anyone that is using this wholesale could replace an updated version of this supported on a later dbt core release. What would next step be? P.s. for BigQuery, i've raised a ticket with google cloud to ask where on earth the metadata about table last updated is because for the life of me, I cannot find it in any information schema table which seems nuts. I assume I'm missing something. |
In terms of next step, we're projecting multiple changes for sources for 1.6. I think that would be a great time to include this. The adapters team is still catching up on all the PRs across the repos, plus regression and bugs. So I don't see us being able to touch this before then even if we wanted to. |
I'm cool with that. Thank you. I'll post an update when Google cloud support confirm where the table updated at metadata is |
The docs are currently outdated as TABLE_STORAGE is a preview feature. It's pretty irritating that BigQuery have removed a common feature "table last updated" from the previous information schema view and added it to a new preview table instead! Grr, hopefully this will be GA soon. |
@Fleid hi there - please can I check this is still pen'd for July? it hurts us more and more every-day. we're running 550 queries an hour rather than 1! |
@Fleid grateful if you could comment on above now that 1.5 is shipped? Just looking to manage internal expectations and plan around this. Our source freshness job sometimes now takes 10+ minutes - should take 10 seconds. Thanks |
Hi @adamcunnington-mlg, we've been slipping on MVs, and I'm not 100% sure we're going to be able to fit the managed sources work into 1.6. I'm exploring alternative options to get this issue done independently. |
@Fleid hi - just wondering if you reached any conclusions? Since first raising this issue, we're now average 10 minutes for our source freshness step - and continuing to increase - and scare us! This feels like such a small (relatively) feature with such a big impact. |
Hey @adamcunnington-mlg - have you tried adding a freshness filter config to limit the amount of data being processed by the |
@graciegoheen thanks for the thought - unfortunately, that won't help in our case - we wouldn't know what to filter for - the whole point of the source freshness stage is to understand what has changed. Our data integration pipeline is triggered almost arbitrarily - a huge volume of arbitrary schedules as well as ad-hoc action which could effect historic date ranges. It is the very outcome of this step that allows our build stage to actually be an effective Delta. |
Hey @adamcunnington-mlg - sadly we didn't have the chance to take a stab at this, and I don't see us being able to in the near future. @dataders if you do end-up re-opening the box of managed sources, this should be on your radar ;) |
@Fleid thanks for the update - although super disappointing! What would it take to get commitment against an upcoming minor version? We'd offer to PR this but my feel is it is going to take a lot to get accepted because this is a fairly fundamental part that affects a core concept in the docs and I suspect there could be varying opinions on how this works - and perhaps you'd want multi-adapter support from the get-go - is there a precedent for extending some generic functionality but filling in the implementation for just 1 backend? Perhaps we will fork so we can add for BQ. |
Removing this from the v1.6 milestone for now. However: I do think there are some compelling use cases that we could enable with faster/cheaper The trickiest thing about the approach here will be our need to run a single query, and then map that query's result back to multiple nodes, rather than running individual queries (in parallel) for each source node. As we chatted about further up in the thread, there's some precedent for this — that's the pattern dbt follows right now for |
It's not unreasonable to scope that down to BQ only at the moment. We need to be pragmatic. @adamcunnington-mlg I understand you not wanting to waste cycles. I do think if you can get a lightweight implementation out, for BQ, following on @jtcohen6's advice above, there is a path to getting it merged. |
@Fleid thanks, we'll take a look at it - although even-then, it does seem like a fairly significant change. Am I right in understanding that the work will be split in following ways:
|
@adamcunnington-mlg I'm closing this as solved in 1.7.3. |
Is this your first time submitting a feature request?
Describe the feature
Currently, source freshness can only be determined by querying a column in a table. When there are a lot of tables, even with a high number of threads, the amount of time it takes to compute source freshness might be unacceptably long - it's effectively just metadata collation after all.
However, lots of databases track table modification times (although don't always isolate DDL changes from DML changes) and expose this via a metadata route. For example, BigQuery has the well-known INFORMATION_SCHEMA tables which expose various metadata attributes.
This isn't consistently available but there's already a precedent for adapter-specific functionality/availability in DBT.
I propose a config parameter, perhaps
location
is exposed to allow user to control where the info comes from (query/metadata). Metadata would only be supported for specific adapters and when it is, the backend implementation is probably something like:It would massively speed up compiling of source freshness and reduce query load/cost (although the latter is negligible and not the main motivation for the FR).
Lastly, a significant benefit is that it doesn't rely on a specific column being consistently available - it provides a more generic implementation for the masses.
Describe alternatives you've considered
An alternative is to somehow allow per-statement increase in threads and we could just spam the database with loads of queries to evaluate source freshness faster.
Who will this benefit?
Everyone! But the benefit is especially pronounced for those who:
Are you interested in contributing this feature?
Yes - can get one of my devs to help although new to contributing so may be slower route if core team can afford it priority
Acceptance Criteria
dbt Labs Supported Adapter Implementation Links
The text was updated successfully, but these errors were encountered: