-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-137] [Bug] Snapshot job creates duplicate valid records for unique_key during 2 colliding runs #4661
Comments
Do you have any automated tests to catch this? We've added some tests on some of our snapshots just to catch odd behavior or weirdness in our source data or human error. Something like this:
We also add an
Then we have this test on top of the filter to make sure that every entry has one and only one
And then this is the macro for our custom test (there's probably something better now in dbt_expectations, we wrote this like 18 months ago:
|
I know the above won't solve your problem but maybe it helps to sleep a little bit better knowing that you'll be immediately warned if the problem does happen again... |
Thanks for sharing your approach. We don't currently have anything like this in place, but I think we'll certainly add something using your tests as a jumping off point. |
The way I've patched this up in the meantime is to create a shim table which imputes what
The SQL is below. I've validated the code on a properly functioning snapshot and it gives an identical result to the correct DBT behavior. with
ranked_entries as (
select
org_id,
billed_month,
dbt_valid_from,
rank() over (
partition by org_id, billed_month order by dbt_valid_from
) as valid_from_rank,
dbt_valid_to
from {{ ref('billing_totals_snapshot') }}
),
synthentic_valid_to as (
select
ranked_entries.org_id,
ranked_entries.billed_month,
ranked_entries.dbt_valid_from,
next_entries.dbt_valid_from as synthetic_dbt_valid_to
from ranked_entries
left join
ranked_entries as next_entries
on ranked_entries.org_id = next_entries.org_id
and ranked_entries.billed_month = next_entries.billed_month
and ranked_entries.valid_from_rank + 1 = next_entries.valid_from_rank
)
select
billing_totals_snapshot.dbt_valid_from,
synthentic_valid_to.synthetic_dbt_valid_to,
billing_totals_snapshot.org_id,
billing_totals_snapshot.billed_month,
...
from {{ ref('billing_totals_snapshot') }}
inner join
synthentic_valid_to
on billing_totals_snapshot.org_id = synthentic_valid_to.org_id
and billing_totals_snapshot.billed_month = synthentic_valid_to.billed_month
and billing_totals_snapshot.dbt_valid_from = synthentic_valid_to.dbt_valid_from |
Hey @rdeese, thanks for opening this. If this is a critical part in the data pipeline, we highly recommend trying to get the real event stream from some kind of change data capture system, since that's going to be the most accurate. But if that's not the case, a few area that you can look into:
|
I am having the same issue, it is occurring almost daily that some (not all) of the snaphosts contain duplicated rows per unique key for a valid record (i.e.
We are also using the |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
I ran into this as well, and it was happening for me when I would have two merge statements that tried to run at the same time. And the sinister part is that for me the duplicate record bug didn't occur until at least 2 snapshots past the async run. Reproducibility steps: 1. Table created with one value:
2. Snapshotting code: DBT_TEST_SNAP.sql
3. Update value: 4. Run snapshot twice at the same time: 5. Update Record again: 6. Update Record again:
When they don't run at the same time, everything is fine, but ideally there'd be some parameterizable logic to put a lock on a table when a snapshot is already running. |
Digging into the snowflake documentation a bit, it sounds like they should be blocking these simultaneous merge statements from occurring by having the table locked by a transaction. But it does sound like if they are using the same connection, they've seen some wonky behavior. The session ID's of the two simultaneous merge statements were different, but it's a bit unclear to me whether or not this would make a difference and if these transactions could bump into each other. One thought I had for testing the theory is by naming the "begin" transaction block when the snowflake_dml_explicit_transaction macro is called from the snowflake adapter. If the transactions are named, I imagine it should eliminate any transaction-melding that might be happening, if that is what's causing the issue. But at this point, it's just a theory. |
Faced the same original issue. It seems dbt snapshotting doesn't read data types consistently. Therefore, the issue can be resolved by specifically casting all columns to what they are. (e.g., all numeric columns to be casted to numeric. Snapshotting messes it up sometimes by interchangeably reading as float and numeric.) Make sure to cast when you read from the source. |
We are encountering this issue as well, although I am only seeing duplicate unique records with exactly 2 null values for Having looked through debug logs, I believe this is the result of 2 colliding jobs running snapshot commands for the same snapshot at (nearly) the precise same time. I'm now trying to figure out a way to avoid these collisions without having to manually adjust schedules or --exclude flags on all of our prod jobs. |
Another user running into this issue where Duplicate records are created when the snapshot job has 2 colliding runs. Sample data, where unique_key='cv_transaction_sid', check_cols=['md_version_hash']
We can see that row 1 is ok, rows 2 and 3 have the same MD_VERSION_HASH but different DBT_SCD_ID, while rows 4 and 5 have same MD_VERSION_HASH and same DBT_SCD_ID. How it happened:
Attached are the debug logs for colliding runs of the same job on dbt cloud. |
I also experienced this issue. Steps to replicate:
What I have observed is that if a duplicate appears for a |
I've also encountered this behavior occurring intermittently, and I suspect that it's caused by some sort of network hiccup in our environment (AWS Airflow, EKS, Snowflake). We have a number of tables using the check all cols approach since the source doesn't have timestamps we can use, and for now this is the only way we can get CDC. Both our dev and prod environments are pointing to the same source, but occasionally one environment will produce duplicates while the other does not. I confirmed that Airflow only triggered the snapshots once, then got digging through the logs on Snowflake. What I eventually found was that the temp table was created for the merge only once - but the merge statement was executed twice for some reason, about 2 hours apart. As a workaround I'm thinking about creating a post hook for these snapshots to de-dupe them. (Edit - and I should mention, we're not using dbt retry on our snapshot runs) |
@graciegoheen This issue happens with only one dbt command running at a time. I've confirmed this in both our Airflow and Snowflake logs. If the command ran twice, I would see dbt's temp table being created twice and the merge statement being run twice. Instead what I see is this:
This is the underlying cause of the behavior @elsander noted before me. Something is causing the merge statement to be triggered twice, which results in brand new rows being inserted twice in close succession. The workaround I have for the moment is running a posthook on the largeish snapshots that seem to get affected by this - it's always snapshots that take multiple hours to run. It's not very efficient, but basically I run this after:
|
Would you be able to provide a reproducible example? Are there specific circumstances / code that cause this to happen (besides the snapshot table being large)? |
Unfortunately not, I spent a fair bit of time trying to find a way to reproduce it but was unsuccessful. There does seem to be an element of randomness to it - eg: for a while this was happening to us in our dev environment but not in prod, despite both being identical in every way including the data being transformed. |
From my perspective, these snapshots/merge statements should be wrapped in a transaction if possible. If these are big queries, depending on configuration, you could have two different snapshot commands executing at the same time especially if it's long-running, and that's when this behavior has occurred for me. If each command was wrapped in a transaction, we wouldn't end up with both queries performing async updates to the same table...I think, haven't tested. My steps above still work for me to replicate this issue. |
The only thing I'll add here is that I was able to replicate this issue
(though intermittently) with a small test table. So it may be easier to see
this behavior with a larger table, but it's not strictly necessary.
…On Thu, Oct 31, 2024, 6:02 AM kjstultz ***@***.***> wrote:
From my perspective, these snapshots/merge statements should be wrapped in
a transaction if possible. If these are big queries, depending on
configuration, you could have two different snapshot commands executing at
the same time especially if it's long-running, and that's when this
behavior has occurred for me.
If each command was wrapped in a transaction, we wouldn't end up with both
queries performing async updates to the same table...I think, haven't
tested.
My steps above still work for me to replicate this issue.
—
Reply to this email directly, view it on GitHub
<#4661 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWLVLFH6IV4MYIA277JPFTZ6IE6BAVCNFSM5NLKC4S2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENBUHE2TQNJWGQ3A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Is there an existing issue for this?
Current Behavior
One of my running snapshots is amassing multiple valid records per unique_key. The competing valid records are produced regardless of whether a value in one of the
check_cols
has changed. Read on for details.I have this snapshot:
The underlying table,
billing_totals
, is unique for the unique key. I wrote a query to check whether any unique keys have more than one valid record in the snapshot table:Running it, I get many results (336,755 to be precise). I would expect zero.
My first guess was that the snapshot was recording real changes to the data, but not setting
dbt_valid_to
on the invalidated records. So I wrote a query to group competing valid records by the combined values of theircheck
columns.If every competing record represented an actual change, we should get
1
foridentical_valid_records
in all cases. I found that in most cases (330K out of 336K),identical_valid_records
=competing_valid_records
, i.e. all of the competing valid records are identical in the check columns.Another thing that I find perplexing about this is that the number of competing valid records is not the same for all affected records. For a snapshot that has been run ~50 times (twice daily), the largest numbers of competing valid records are 21, 20, and 2, but 3-19 are also represented.
Expected Behavior
I expect that a snapshot table will have a single valid (i.e.
dbt_valid_to
isNULL
) record for eachunique_key
.Steps To Reproduce
Unfortunately I don't have steps to reproduce this, yet.
Relevant log output
No response
Environment
Link to python:3.7.9-slim image.
What database are you using dbt with?
redshift
Additional Context
I have other snapshots running that are unaffected by this issue.
There is a previous issue that appears to be similar to this one, #2607 . I decided to make a new issue because that one is fairly stale (last substantive discussion was in 2020), has a confusing history (the ticket at first claimed the opposite problem, but was later renamed), and a lot of the discussion involves post-hooks which don't apply to my case.
Hopefully this new issue is an opportunity to have a more focused discussion of the problem -- but I'm happy to move conversation over to that ticket if the maintainers prefer.
The text was updated successfully, but these errors were encountered: