-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handled duplicates during bulk_create calls #1717
Conversation
Multiple processes inserting the same objects into the DB can cause unexpected duplicates. Now that we're using Django 3+ we can use the `ignore_conflicts=True` param for bulk_create. closes #8967
This was inspired by this other PR. Note, this does not adjust this call to |
Attached issue: https://pulp.plan.io/issues/8967 |
@@ -769,7 +769,7 @@ def add_content(self, content): | |||
) | |||
) | |||
|
|||
RepositoryContent.objects.bulk_create(repo_content) | |||
RepositoryContent.objects.bulk_create(repo_content, ignore_conflicts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it really confusing that we could be seeing duplicates on these two tables (this one and the one below) when only one task ought to have an exclusive lock on the repository
table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. A conflict here would indicate a real bug elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the big comment below, but a) this doesn't help resolve the problem and b) due to the locking this is unnecessary. We shouldn't merge this line change.
@@ -396,7 +396,9 @@ async def _handle_remote_artifacts(self, batch): | |||
# Artifact sha256 for our ordering. | |||
if ras_to_create: | |||
ras_to_create_ordered = sorted(list(ras_to_create.values()), key=lambda x: x.sha256) | |||
await sync_to_async(RemoteArtifact.objects.bulk_create)(ras_to_create_ordered) | |||
await sync_to_async(RemoteArtifact.objects.bulk_create)( | |||
ras_to_create_ordered, ignore_conflicts=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we didn't add it here, would it be broken wrt concurrency?
I am wondering why we would not have had a workaround in place before this feature was added to django.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is the very place the issue #8967 happens. XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valuable change that we just haven't hit due to the race condition being rare. I am +1 to making this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still intend to make this single change? I think, we agreed upon it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence. We agreed it seems valuable, but also is it right in 100% of cases with no side effects? If you're sure I don't mind opening a PR. Also it needs an issue. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tough call between "if it ain't broken..." and "will we remember, once we get a ticket filed".
Does django still populate the in memory representations for the ignored entries? Is it like get_or_create
in that respect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know and I've been wondering this for a while actually. Maybe I should do a few minutes of testing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a simple test example. It confirms that we do not re-hydrate data from the database in cases where there would be database conflicts. It also confirms we do not overwrite the database data with the in-memory data attempting to be saved. This is partly what worries me about broad use of ignore_conflicts
.
import os
os.environ["DJANGO_SETTINGS_MODULE"] = "pulpcore.app.settings"
import django
django.setup()
from pulpcore.app.models import Task
Task.objects.all().delete()
new_task = Task(state='completed')
new_task.save()
bulk_tasks = [
Task(pk=new_task.pk, state='canceled'),
Task()
]
objs = Task.objects.bulk_create(bulk_tasks, ignore_conflicts=True)
print(objs) # This shows `[Task: [canceled], Task: []]`
print(Task.objects.get(pk=new_task.pk)) # Prints `Task: [completed]` so the new obj wasn't saved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we leave it until we know its a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. 👍
OK so I've done more work with this and I've learned a few things:
In reading the traceback more closely, it's not that there is a duplicate key error. The error we actually get is more like When I look at the repo in memory backing that missing repositoryversion it's name is: The root cause of this is not clear to me, but I think it has to do with the creation of hidden repos maybe? For example when I use the same reproducer to sync a different repo (which doesn't produce a hidden repo) it works just fine with no pulpcore changes. That sync URL is |
@@ -15,7 +15,7 @@ def setUp(self): | |||
for _ in range(0, 5): | |||
contents.append(Content(pulp_type="core.content")) | |||
|
|||
Content.objects.bulk_create(contents) | |||
Content.objects.bulk_create(contents, ignore_conflicts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in test code I think we don't need this.
Is it possible, that the hidden repo version, you are referring to is not yet in the database, because it may have ended up in a transaction it wasn't meant to be? I once had a theory, that Django ORM in combination with async python may end up doing such things, because code is interleaved in an unpredictable manner. |
I'm closing and moving the issue back to NEW for pulp_rpm team to comment. |
Multiple processes inserting the same objects into the DB can cause
unexpected duplicates. Now that we're using Django 3+ we can use the
ignore_conflicts=True
param for bulk_create.closes #8967