Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handled duplicates during bulk_create calls #1717

Closed
wants to merge 1 commit into from

Conversation

bmbouter
Copy link
Member

@bmbouter bmbouter commented Nov 9, 2021

Multiple processes inserting the same objects into the DB can cause
unexpected duplicates. Now that we're using Django 3+ we can use the
ignore_conflicts=True param for bulk_create.

closes #8967

Multiple processes inserting the same objects into the DB can cause
unexpected duplicates. Now that we're using Django 3+ we can use the
`ignore_conflicts=True` param for bulk_create.

closes #8967
@bmbouter
Copy link
Member Author

bmbouter commented Nov 9, 2021

This was inspired by this other PR.

Note, this does not adjust this call to bulk_create(...) because I wasn't sure if the ignore_conflicts=True would be returning the data as expected for those duplicates. Basically my changes here just leave it alone.

@pulpbot
Copy link
Member

pulpbot commented Nov 9, 2021

Attached issue: https://pulp.plan.io/issues/8967

@@ -769,7 +769,7 @@ def add_content(self, content):
)
)

RepositoryContent.objects.bulk_create(repo_content)
RepositoryContent.objects.bulk_create(repo_content, ignore_conflicts=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it really confusing that we could be seeing duplicates on these two tables (this one and the one below) when only one task ought to have an exclusive lock on the repository table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A conflict here would indicate a real bug elsewhere.

Copy link
Member Author

@bmbouter bmbouter Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the big comment below, but a) this doesn't help resolve the problem and b) due to the locking this is unnecessary. We shouldn't merge this line change.

@@ -396,7 +396,9 @@ async def _handle_remote_artifacts(self, batch):
# Artifact sha256 for our ordering.
if ras_to_create:
ras_to_create_ordered = sorted(list(ras_to_create.values()), key=lambda x: x.sha256)
await sync_to_async(RemoteArtifact.objects.bulk_create)(ras_to_create_ordered)
await sync_to_async(RemoteArtifact.objects.bulk_create)(
ras_to_create_ordered, ignore_conflicts=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we didn't add it here, would it be broken wrt concurrency?
I am wondering why we would not have had a workaround in place before this feature was added to django.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is the very place the issue #8967 happens. XD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valuable change that we just haven't hit due to the race condition being rare. I am +1 to making this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still intend to make this single change? I think, we agreed upon it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence. We agreed it seems valuable, but also is it right in 100% of cases with no side effects? If you're sure I don't mind opening a PR. Also it needs an issue. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tough call between "if it ain't broken..." and "will we remember, once we get a ticket filed".
Does django still populate the in memory representations for the ignored entries? Is it like get_or_create in that respect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know and I've been wondering this for a while actually. Maybe I should do a few minutes of testing...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple test example. It confirms that we do not re-hydrate data from the database in cases where there would be database conflicts. It also confirms we do not overwrite the database data with the in-memory data attempting to be saved. This is partly what worries me about broad use of ignore_conflicts.

import os

os.environ["DJANGO_SETTINGS_MODULE"] = "pulpcore.app.settings"
import django
django.setup()


from pulpcore.app.models import Task

Task.objects.all().delete()

new_task = Task(state='completed')
new_task.save()

bulk_tasks = [
    Task(pk=new_task.pk, state='canceled'),
    Task()
]

objs = Task.objects.bulk_create(bulk_tasks, ignore_conflicts=True)

print(objs)  # This shows `[Task:  [canceled], Task:  []]`

print(Task.objects.get(pk=new_task.pk))  # Prints `Task:  [completed]` so the new obj wasn't saved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we leave it until we know its a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. 👍

@bmbouter
Copy link
Member Author

OK so I've done more work with this and I've learned a few things:

  1. The reproducer script here does cause the traceback in the same comment 100% of the time.
  2. With this patch applied, the script still causes the same exception 100% of the time.

In reading the traceback more closely, it's not that there is a duplicate key error. The error we actually get is more like Key (version_added_id)=(a5c43989-e695-4f07-9bdb-0f879b9cdd31) is not present in table "core_repositoryversion". which I also get locally. When I enter the debugger upon that bulk_create raising an exception, I can look at the core_repositoryversion table and indeed there is no repositoryversion with a pk starting with a5c4... Basically that repo version doesn't exist except in memory?

When I look at the repo in memory backing that missing repositoryversion it's name is: '../../../AppStream/x86_64/os/-31410651d4c07a347a6dbba79cfb4df7c6418e39b0a19c70a84eff2c8b3b8953'.

The root cause of this is not clear to me, but I think it has to do with the creation of hidden repos maybe? For example when I use the same reproducer to sync a different repo (which doesn't produce a hidden repo) it works just fine with no pulpcore changes. That sync URL is http://mirror.centos.org/centos-7/7/os/x86_64/.

@@ -15,7 +15,7 @@ def setUp(self):
for _ in range(0, 5):
contents.append(Content(pulp_type="core.content"))

Content.objects.bulk_create(contents)
Content.objects.bulk_create(contents, ignore_conflicts=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in test code I think we don't need this.

@mdellweg
Copy link
Member

Is it possible, that the hidden repo version, you are referring to is not yet in the database, because it may have ended up in a transaction it wasn't meant to be? I once had a theory, that Django ORM in combination with async python may end up doing such things, because code is interleaved in an unpredictable manner.

@bmbouter bmbouter closed this Nov 16, 2021
@bmbouter bmbouter deleted the add-ignore-duplicates branch November 16, 2021 16:36
@bmbouter
Copy link
Member Author

I'm closing and moving the issue back to NEW for pulp_rpm team to comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants