Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: race condition between heartbeat and releasing lock #35

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bfraterman-tkhsecurity
Copy link

Releasing of a lock could fail (putItem would fail the condition) when
there was a heartbeat running at the same time.

When the 2 calls to dynamoDB happen synchronously, the GUID used by the heartbeat would be put into dynamodb, while the previous GUID would be used to release the lock. Releasing of the lock would therefore fail. I've only tested this for FailOpen locks, where there were longer waits because of this.
Perhaps with FailClosed locks the problem would be even worse.

The release of a lock now waits until the heartbeat is finished, using a Promise.

Since it relies on very specific timing it's hard to write a unit test for this.

Releasing of a lock could fail (putItem would fail the condition) when
there was a heartbeat running at the same time.

The release of a lock now waits until the heartbeat is finished.
@tristanls
Copy link
Owner

tristanls commented Mar 12, 2022

Thank you @bfraterman-tkhsecurity . What you described does look like a valid race condition for FailOpen lock, which can cause longer wait times (maximum leaseDurationMs) for released locks.

There is no problem with FailClosed lock because there are no heartbeats on those.

I think what we need is a semaphore. Now that you've pointed out the race condition, it is insufficient for release to wait for heartbeat, release and heartbeat can only happen one at a time.

Failure case interleaving with release only waiting for heartbeat to start:

  • heartbeat_start
  • release_start (wait)
  • heartbeat_commit
  • release_start (resume)
  • heartbeat_start
  • heartbeat_commit
  • release_commit (fail)

@bfraterman-tkhsecurity
Copy link
Author

bfraterman-tkhsecurity commented Mar 16, 2022

Basically, the self._heartbeatPromise is used as semaphore. If it exists, releasing the lock will wait until the heartbeat is finished. The other way around is safeguarded by self._released.

The failure scenario that you describe won't happen in this PR. Node is single-threaded, so no 2 pieces of code will run simultaneous. But whenever a thenable is used, it will be scheduled in node when doing something asynchronously, giving the possibility to yield to another function or thenable that is scheduled.

So the problem in the original code was:

  • heartbeat_start. Runs until self._config.dynamodb.put with a new GUID is called.
  • release_start. self._guid still contains the old GUID. Runs until self._config.dynamodb.put is called (with the old GUID).
  • heartbeat_finished. The put is done and the rest of the heartbeat code (in the put callback) runs. self._guid is set with the new GUID.
  • release_finish (fail). The put of the release method fails because of the conditional check on GUID mismatch in dynamoDB.

In the PR this is fixed by the following:

  • heartbeat_start. self._heartbeatPromise is set. All exit paths out of refreshLock resolve the Promise. self._released is checked so we're certain release_start didn't run yet. Runs until self._config.dynamodb.put is called.
  • release_start. Sets self._released. Checks if self._heartbeatPromise is set. If it is, it will wait for the promise to resolve.
  • heartbeat_finished. The put is done and the rest of the heartbeat code (in the put callback) runs. self._guid is set with the new GUID. The self._heartbeatPromise is resolved.
  • release_continue. The self._guid contains the new GUID. Runs until self._config.dynamodb.put is called (with the new GUID).
  • release_finish. The put of the release succeeds.

But maybe the code would become better readable when using something like https://www.npmjs.com/package/async-mutex. What do you think?

And do you want me to refactor the code to be able to test this in a unit test? This will grow the PR considerably I think, since all heartbeat and releasing code will have to be split up and moved around.

@bfraterman-tkhsecurity
Copy link
Author

Now that AWS is deprecating the V2 SDK, we're looking at migrating to the V3 SDK.

I wonder if/how this PR can be merged? Otherwise we'll have to keep using our fork for our project.

@tristanls
Copy link
Owner

Hi @bfraterman-tkhsecurity ,

I got distracted and haven't come back to this repo since my last comment.

With a migration to v3, it probably makes sense to use your fork. I don't intend to spend more time on this module in the near future, so it may be worth publishing your v3 migrated fork for the community.

Thank you again for the PR.

Cheers

@gjhommersom gjhommersom mentioned this pull request Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants