Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition with aws_s3_endpoint acquire/release #209

Merged
merged 3 commits into from
Sep 14, 2022
Merged

Conversation

graebm
Copy link
Contributor

@graebm graebm commented Sep 9, 2022

Issue:
#202

There are race conditions around the removal of endpoints from the client's hashtable

Description of changes:

  1. Always hold the client's lock while acquiring and releasing an endpoint's refcount.
    • The old bugs were due to an unholy mix of atomic operations and locks. But when you have two related bits of data touched from multiple threads (the endpoint, and the hashtable it's stored in) atomics don't cut it anymore. You must always be holding the lock when touching either bit of data.
  2. Add a vtable for aws_s3_endpoint
    • This lets us separate the "real" implementation from the "mocked testing" implementation. The existing code had a lot of "if owned by client" (aka "real world") branches and made the lifetime/ownership hard to follow. Now the "real" implementation is only the "real" implementation.

Rambling:
This isn't our first time trying to fix these race conditions. See:

We also considered moving the endpoint hashtable out of synced_data and onto thread_data, and changing aws_s3_acquire()/release() calls so they become scheduled tasks under the hood. But this would have been a significant refactor. Once I'd spent 2 days staring at this code I understood it well enough to (hopefully) fix it with a small change.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@graebm graebm marked this pull request as ready for review September 9, 2022 23:31
@graebm graebm mentioned this pull request Sep 9, 2022

/* not calling aws_s3_endpoint_acquire() because that would grab the client lock */
AWS_ASSERT(endpoint->client_synced_data.ref_count > 0);
++endpoint->client_synced_data.ref_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to contradict pr description that we should always hold lock on refcount changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the comment to be more clear

we already HAVE the lock here in this code, is why we're not calling that other function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I did something else
took away comment and made the code self-documenting
we DO call the function now, but pass in an arg to say that we're holding the lock

source/s3_endpoint.c Outdated Show resolved Hide resolved
Copy link
Contributor

@TingDaoK TingDaoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix & ship.

struct aws_ref_count ref_count;
struct {
/* This is NOT an atomic ref-count.
* The endpoint lives in hashtable: `aws_s3_client.threaded_data.endpoints`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: aws_s3_client.threaded_data.endpoints is it out of date?

@graebm graebm enabled auto-merge (squash) September 14, 2022 20:13
@graebm graebm merged commit 3530a06 into main Sep 14, 2022
@graebm graebm deleted the endpoints3 branch September 14, 2022 20:22
graebm added a commit to awslabs/aws-crt-java that referenced this pull request Sep 14, 2022
see: awslabs/aws-c-s3#209

Update submodules to latest:
aws-c-io: v0.13.2 -> v0.13.4
aws-c-s3: v0.1.46 -> v0.1.48
aws-lc: v1.1.0 -> v1.2.0
s2n: v1.3.20 -> v1.3.21
graebm added a commit to awslabs/aws-crt-java that referenced this pull request Sep 14, 2022
see: awslabs/aws-c-s3#209

Update submodules to latest:
aws-c-io: v0.13.2 -> v0.13.4
aws-c-s3: v0.1.46 -> v0.1.48
aws-lc: v1.1.0 -> v1.2.0
s2n: v1.3.20 -> v1.3.21
graebm added a commit to awslabs/aws-crt-cpp that referenced this pull request Sep 15, 2022
*Issue #:* awslabs/aws-c-s3#202

*Description of changes:* update aws-c-s3 submodule, containing fix from PR awslabs/aws-c-s3#209
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants