-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve normalization incremental runtime (incremental DBT) #4286
Comments
Update |
Update |
Update |
To solve this issue, we'd probably need:
Read more here for example: |
Thanks for tracking this issue! This is a blocker for our company (Bolt) to adopt Airbyte. We want to use Airbyte to keep our data warehouse (BigQuery) in sync with Operational DB (Postgres, 500GB now but growing fast) and the current way normalization works will cost us a fortune to use Airbyte because BigQuery is charged by both storage and query. |
This is exactly the same issue we had. After the bill shock from last month, we've had to turn Airbyte off completely for the interim. |
This might be a naive approach but I gave this a bit of thought and figured it would be possible to use the |
actually dbt supports merge materialization. I made these changes sample dbt code
and it started loading incrementally. in my case
changed to
note that byte usage is still same because bigquery doesn't care about WHERE condition. so this change wouldn't save you on money. To mitigate this you can either switch to bigquery reserved pricing (need a minimum of 96$/day) or we can probably cluster the resulting table by _airbyte_emitted_at |
That makes sense - to make incremental normalization as efficient as possible, the raw table should be clustered/sorted by whichever field is used for the incremental logic. I think this needs to be done at the destination level though, since the concepts vary between destinations (e.g. clustering in BigQuery vs sorting in Redshift). For instance, for Redshift it could be done here. I'm not sure whether this is in the scope of this issue. |
update: as mentioed in https://airbytehq.slack.com/archives/C019WEENQRM/p1633606112398200 we achieved significant cost saving by only clustering on _airbyte_emitted_at and writing a custom dbt that uses dbt's incremental tables and only loads new things after each run. |
Hi everyone, we have implemented this on custom DBT transformations and I'd like to share some issues we faced:
will bring issues when the last job failed or was canceled for some reason after destination succeeded (transformation failure due to BQ job limit for example). We finally went with a multiplier of the connection running cycle (if the connection runs every 1h we look 6h back).
Using the macro |
Tell us about the problem you're trying to solve
We've received many reports in the past few weeks of normalization taking an incredibly long time to run.
One example:
logs-42-0.txt
A user synced a few hundred records from Zendesk. The sync took about 25s. Normalization took about 50 minutes. For the record the destination contains >200k records but the individual sync read only a few hundred new records. Slack thread
One very likely root cause is that currently normalization reads the entire
_raw
table and recreates the normalized tables completely from scratch every time a sync happens. This means work scales with the size of the total dataset. We want work to scale with the size of the new data written to the destination.In the interest of complete intellectual honesty, I'm actually not 100% sure the issue is the size of the target dataset. The only reason I hesitate is because other users have reported syncing >100gb DBs into warehouses. If this problem was this bad on 200k records from an API, then surely it would have ground the large DB syncs to a halt. So there might be more nuance to this issue than I'm realizing.
Describe the solution you’d like
I would like normalization runtime to scale with the size of data written not total data set. Also I would like normalization to take a "reasonable" amount of time to run. ~300 new records and 200k total records taking 50m seems crazy. 5 minutes is acceptable. 1 minute would be amazing.
Describe the alternative you’ve considered or used
Not using normalization
Rolling out a custom connector which implements its own normalization
The text was updated successfully, but these errors were encountered: