Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates seen with merge operation #1330

Closed
Kiran-G1 opened this issue Aug 11, 2022 · 4 comments
Closed

Duplicates seen with merge operation #1330

Kiran-G1 opened this issue Aug 11, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@Kiran-G1
Copy link

deltaTable.alias("original")
.merge(batch_of_records.alias("updates"), "original.id = updates.id")
..whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

ID column which was used in this example is unique.

@Kiran-G1 Kiran-G1 added the bug Something isn't working label Aug 11, 2022
@nkarpov nkarpov self-assigned this Aug 11, 2022
@nkarpov
Copy link
Collaborator

nkarpov commented Aug 11, 2022

Hi @Kiran-G1 can you share a more complete example with sample data? A full reproduction will help us confirm and track this down. Here is a good example, #1279

@tdas
Copy link
Contributor

tdas commented Aug 16, 2022

There is a lot of very relevant discussion on duplicates in merge here - #527

@Kiran-G1
Copy link
Author

Kiran-G1 commented Sep 21, 2022

I partially replicated this issue with a different use case.

Use case is to update the old record's datetime to current time and latest record datetime to 2050 in delta by performing merge operation when a new record comes.

But all of the records were considered as new records...

image

I noticed that all those records got inserted at the same time ( see the load time) into the delta table.

My hunch is, since delta table write happens in parallel by spark and all these records got inserted at the same time , due to this race condition merge condition didn't satisfy.

@tdas @nkarpov

@allisonport-db
Copy link
Collaborator

Can we focus this discussion on the original issue #527? This is the same problem. Can you add more information on this replication there (code, source/target details etc)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants