Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent double-publishing #1006

Merged
merged 5 commits into from
Oct 6, 2022
Merged

Prevent double-publishing #1006

merged 5 commits into from
Oct 6, 2022

Conversation

mvandenburgh
Copy link
Member

@mvandenburgh mvandenburgh commented Apr 6, 2022

Closes #956
Closes #955

Currently, there is a race condition where two or more simultaneous calls to the /publish endpoint may result in multiple publishes occurring. This PR fixes this in two parts:

  1. On the backend, the publishing process is now a celery task.
  2. On the frontend, when the user clicks the publish button it checks if a publish is ongoing, and if one is then it polls the server continuously until its finished and then displays a toast message linking to the new version.

dandiapi/api/views/version.py Outdated Show resolved Hide resolved
dandiapi/api/views/version.py Outdated Show resolved Hide resolved
web/src/views/DandisetLandingView/DandisetPublish.vue Outdated Show resolved Hide resolved
@mvandenburgh mvandenburgh marked this pull request as draft April 6, 2022 20:36
@mvandenburgh mvandenburgh force-pushed the prevent-double-publishing branch 2 times, most recently from 54651e6 to 727484c Compare April 20, 2022 19:51
@mvandenburgh
Copy link
Member Author

mvandenburgh commented Apr 20, 2022

Why not just disable the button instead, then poll to find out when the publish is finish to re-enable?

I guess the juicier question is what to expect if a bunch of people happen to all click publish on the same Dandiset at the same time. It seems like one of them would "win" the race, and the rest would receive an error message that needs to be handled?

@waxlamp I implemented this. This is what happens when two people try to publish at the same time (I added some sleep() calls to the backend to make the publish take a long time, I don't think it would ever take this long in a real publish)

dandipublish

@mvandenburgh mvandenburgh marked this pull request as ready for review April 20, 2022 20:07
@yarikoptic
Copy link
Member

@djarecka would you be so kind to give this PR a test , in particular in relation to #955 -- may be we should merge it and see in staging but I would love to see at least some test-drive by someone else. I will try to test drive it as well in upcoming days, so far haven't spotted anything questionable in the diff

@djarecka
Copy link
Member

@yarikoptic - you mean adding tests to the changes in vue code?

@mvandenburgh
Copy link
Member Author

mvandenburgh commented Jun 8, 2022

After coming back to this with fresh eyes after 2 months, I think my approach here is fundamentally flawed. The implementation here does technically work, but it relies on the assumption that a locked Version row means it is being published. This may be a safe assumption for now, but if we need to lock Versions for other purposes in the future this will no longer work.

I believe the ideal solution would be to introduce a PUBLISHING status to the Version.status field + add a check to the start of the publish endpoint that returns a 429 if version.status == PUBLISHING. However, this introduces a potential race condition if another request occurs after checking version.status but before setting it to PUBLISHING. So, we need to lock the Version using select_for_update. This in turn creates another problem, which is if a version takes a long time to be published, the second request thread will be stuck waiting to acquire the lock.

Taking a step back, I think publishing would be more appropriate as a celery task. iirc, this issue was motivated by the fact that dandisets with tens of thousands of assets can take a few seconds (or more) to publish. That reason alone is enough evidence that this should be a celery task, I think. Plus, the concurrency logic becomes a lot simpler than what I have currently-

@transaction.atomic
@action(detail=True, methods=['POST'])
def publish(self, request, **kwargs):
  version: Version = self.get_queryset().select_for_update().filter(**kwargs).first()
  if version.status == "PUBLISHING":
    return 429
  elif version.version != 'draft':
    return 400
  # other checks.....
  .....
  version.status = 'PUBLISHING'
  version.save()
  publish_task.delay(version.pk) # version.status would be set to PUBLISHED at the end of this task
  return 200

@mvandenburgh mvandenburgh marked this pull request as draft June 8, 2022 18:42
@mvandenburgh mvandenburgh removed the request for review from waxlamp June 8, 2022 18:42
@waxlamp
Copy link
Member

waxlamp commented Jun 9, 2022

I think this is a good design overall. Some thoughts:

  1. I think 409 Conflict might be a better error code than 429 Too Many Requests. The latter is meant as a rate limiting condition, whereas this is straight up a disallowed action based on the state of the resource.
  2. One final edge case to consider: after the Version has moved into the PUBLISHING state, what do we do if the publishing task crashes or otherwise doesn't complete? One idea is to give the user a way to cancel the publish (thus manually returning the state to PUBLISHED and letting them try again).
  3. I think you've solved this problem through design, but if "the second request thread [is] stuck waiting to acquire the lock", I think you could do a non-blocking check instead of a blocking one. Essentially, after a short timeout of a few seconds, the second thread could receive a 408 Request Timeout indicating that it should try again later.

After coming back to this with fresh eyes after 2 months, I think my approach here is fundamentally flawed. The implementation here does technically work, but it relies on the assumption that a locked Version row means it is being published. This may be a safe assumption for now, but if we need to lock Versions for other purposes in the future this will no longer work.

I believe the ideal solution would be to introduce a PUBLISHING status to the Version.status field + add a check to the start of the publish endpoint that returns a 429 if version.status == PUBLISHING. However, this introduces a potential race condition if another request occurs after checking version.status but before setting it to PUBLISHING. So, we need to lock the Version using select_for_update. This in turn creates another problem, which is if a version takes a long time to be published, the second request thread will be stuck waiting to acquire the lock.

Taking a step back, I think publishing would be more appropriate as a celery task. iirc, this issue was motivated by the fact that dandisets with tens of thousands of assets can take a few seconds (or more) to publish. That reason alone is enough evidence that this should be a celery task, I think. Plus, the concurrency logic becomes a lot simpler than what I have currently-

@transaction.atomic
@action(detail=True, methods=['POST'])
def publish(self, request, **kwargs):
  version: Version = self.get_queryset().select_for_update().filter(**kwargs).first()
  if version.status == "PUBLISHING":
    return 429
  elif version.version != 'draft':
    return 400
  # other checks.....
  .....
  version.status = 'PUBLISHING'
  version.save()
  publish_task.delay(version.pk) # version.status would be set to PUBLISHED at the end of this task
  return 200

@mvandenburgh
Copy link
Member Author

I think this is a good design overall. Some thoughts:

  1. I think 409 Conflict might be a better error code than 429 Too Many Requests. The latter is meant as a rate limiting condition, whereas this is straight up a disallowed action based on the state of the resource.

Whoops, I meant to put 423 Locked instead of 429 Too Many Requests. "Locked" seems more descriptive for this scenario to me, what do you think @waxlamp ?

  1. One final edge case to consider: after the Version has moved into the PUBLISHING state, what do we do if the publishing task crashes or otherwise doesn't complete? One idea is to give the user a way to cancel the publish (thus manually returning the state to PUBLISHED and letting them try again).

I like that idea as well 👍

  1. I think you've solved this problem through design, but if "the second request thread [is] stuck waiting to acquire the lock", I think you could do a non-blocking check instead of a blocking one. Essentially, after a short timeout of a few seconds, the second thread could receive a 408 Request Timeout indicating that it should try again later.

Right, since publishing would now be done as a task, this is no longer a concern. A request would only hold the lock while checking the status field of the Version, after which it would be immediately released.

@waxlamp
Copy link
Member

waxlamp commented Jun 13, 2022

I think this is a good design overall. Some thoughts:

  1. I think 409 Conflict might be a better error code than 429 Too Many Requests. The latter is meant as a rate limiting condition, whereas this is straight up a disallowed action based on the state of the resource.

Whoops, I meant to put 423 Locked instead of 429 Too Many Requests. "Locked" seems more descriptive for this scenario to me, what do you think @waxlamp ?

I think that code would be fine. It's probably more suitable than 409, but I don't think it matters much which 4xx code we pick for this.

@mvandenburgh
Copy link
Member Author

One idea is to give the user a way to cancel the publish (thus manually returning the state to PUBLISHED and letting them try again).

UI-wise, I think using the cancel button introduced in #1122 would be perfect here.

@mvandenburgh mvandenburgh force-pushed the prevent-double-publishing branch 2 times, most recently from 6946c19 to 3bb4f88 Compare June 21, 2022 20:19
@mvandenburgh mvandenburgh changed the base branch from master to publish-dialog June 21, 2022 20:19
@mvandenburgh mvandenburgh force-pushed the prevent-double-publishing branch 2 times, most recently from c032eaf to 0a04f30 Compare June 21, 2022 20:25
Base automatically changed from publish-dialog to master July 22, 2022 20:01
@waxlamp
Copy link
Member

waxlamp commented Aug 3, 2022

@mvandenburgh, status update on this PR? AFAIK, the design discussions we've had here are still valid. Confirm?

@mvandenburgh
Copy link
Member Author

@mvandenburgh, status update on this PR? AFAIK, the design discussions we've had here are still valid. Confirm?

Yes, everything here is still valid, I just have to implement the design. This PR sort of fell off my radar in favor of higher priority bugs, but those seem to be mostly resolved so I'll rebase this and start working on it again

Comment on lines +204 to +219
AssetVersions.objects.bulk_create(
[
AssetVersions(asset_id=asset['id'], version_id=new_version.id)
for asset in already_published_assets.values('id')
]
)

# Publish any draft assets
draft_assets = old_version.assets.filter(published=False).all()
for draft_asset in draft_assets:
draft_asset.publish()
Asset.objects.bulk_update(draft_assets, ['metadata', 'published'])

AssetVersions.objects.bulk_create(
[AssetVersions(asset_id=asset.id, version_id=new_version.id) for asset in draft_assets]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these loops that iterate over every asset in a dandiset are probably going to lead to memory issues when publishing dandisets with large numbers of assets. These are copied 1-1 from the original publishing endpoint code, so I'm not going to attempt to fix them here; I'll do some memory profiling and make a follow up PR once I come up with a solution to this

old_version: Version = Version.objects.get(id=version_id)
new_version: Version = old_version.publish_version

new_version.doi = doi.create_doi(new_version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, but could you make a follow on PR moving this into on_commit? This is a bug waiting to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain further about the bug risk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waxlamp as it is now, if something goes wrong and the transaction gets rolled back, the DOI is still created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, thanks!

@mvandenburgh mvandenburgh force-pushed the prevent-double-publishing branch 3 times, most recently from 25f8c3c to 0272d91 Compare September 26, 2022 13:51
@mvandenburgh mvandenburgh force-pushed the convert-publish-component-script-setup branch from acb3aff to df664b3 Compare September 26, 2022 13:51
Base automatically changed from convert-publish-component-script-setup to master September 26, 2022 14:45
@mvandenburgh mvandenburgh added patch Increment the patch version when merged release Create a release when this pr is merged labels Sep 26, 2022
@mvandenburgh
Copy link
Member Author

mvandenburgh commented Sep 28, 2022

@jwodder @yarikoptic the CLI tests are failing here. It appears to be happening because of the changes to the semantics of the publishing process in this PR, so we'll need some updates on the CLI side to unblock this.

Currently, the publishing process is like this -

  • User sends a POST request to the /publish endpoint
  • The server receives the request, performs the publish operation, sets the status of the draft version to PUBLISHED, and responds with a 200 OK and info about the newly published version.

This PR updates it to do the following -

  • User sends a POST request to the /publish endpoint
  • The server receives the request, sets the status of the draft version to PUBLISHING, queues up an async worker task to publish the dandiset, and responds with a 202 ACCEPTED.
  • Asynchronously, a worker picks up the publishing task off the queue, performs the publish, and sets the status of the draft version to PUBLISHED.

To tell when a version has finished publishing, you can poll the /api/dandisets/<identifier>/versions/draft/info/ endpoint, and when the status field in that response is "PUBLISHED", you can call /api/dandisets/<identifier>/versions to get the newly published version (they are sorted by creation time so the first version with a status of "PUBLISHED" in that response is guaranteed to be the most recently published version) EDIT(10/5): this statement isn't correct. The actual semantics are, "if the draft version has a status of Published, then the first version returned in the response from /api/dandisets/<identifier>/versions is guaranteed to be the most recently published version.

Can you make the necessary changes to the CLI to account for this? Please let me know if anything is unclear or if you need any more information.

- After clicking the publish button, display a loading indicator and poll the server to determine when the publishing is complete
- Once the publishing is complete, display a snackbar notification with a link to the newly published version
- If the publish button is clicked while a publish is ongoing, display an error message and begin polling for the publish to complete. Once completed, display a link to it.
- Add new test for publish task
- Modify publish rest test to only test for PUBLISHING status, not entire publish workflow
- Consolidate two publish REST endpoint tests into one
Add specific error messages instead of returing "Dandiset metadata or asset metadata is not valid" everytime a publish can't occur
@mvandenburgh mvandenburgh force-pushed the prevent-double-publishing branch 2 times, most recently from 34318b8 to e0d8267 Compare October 6, 2022 16:51
@mvandenburgh mvandenburgh merged commit 71961c8 into master Oct 6, 2022
@mvandenburgh mvandenburgh deleted the prevent-double-publishing branch October 6, 2022 17:10
@dandibot
Copy link
Member

dandibot commented Oct 6, 2022

🚀 PR was released in v0.2.53 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Increment the patch version when merged release Create a release when this pr is merged released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent double publishing of dandisets Display feedback during publish process
7 participants