Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for new partition management #717

Merged
merged 6 commits into from
Apr 28, 2022

Conversation

sebastianburckhardt
Copy link
Collaborator

This fixes a bug in the partition management logic for ownership leases: all available ownership leases (for partitions that are expired and for which intent leases are owned) must be acquired, regardless of how many nodes there are and how many partitions have been already acquired. The balancing is already taken care of by the intent leases.

Also, this PR adds an exclusion so lease renewals are not subjected to request throttling; because request throttling on lease renewals can be counterproductive in heavily loaded situations.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this fix!

@@ -115,18 +115,21 @@ public Table GetTableReference(string tableName)
public Task<T> MakeTableStorageRequest<T>(Func<OperationContext, CancellationToken, Task<T>> storageRequest, string operationName, string? clientRequestId = null) =>
this.MakeStorageRequest<T>(storageRequest, TableAccountName, operationName, clientRequestId);

public Task MakeBlobStorageRequest(Func<OperationContext, CancellationToken, Task> storageRequest, string operationName, string? clientRequestId = null) =>
this.MakeStorageRequest(storageRequest, BlobAccountName, operationName, clientRequestId);
public Task MakeBlobStorageRequest(Func<OperationContext, CancellationToken, Task> storageRequest, string operationName, string? clientRequestId = null, bool throttle = true) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestion: I think it's more common in .NET API design to name a variable like this as force with a default value of false.

@saguiitay
Copy link
Contributor

Can this be completed and released, so that I can test it?

@sebastianburckhardt
Copy link
Collaborator Author

sebastianburckhardt commented Apr 27, 2022

Ideally, you could help us test these changes before they go in the next release, it would give us confidence we are not breaking anything. Is this possible? @cgillum, what do you think?

I also have two more modifications to the partition manager I wanted to add and test. But perhaps I should use a separate PR.

@cgillum
Copy link
Member

cgillum commented Apr 27, 2022

I would definitely prefer having some validation before doing a release. @saguiitay are you able to test a private version of this deployed to myget.org?

@sebastianburckhardt
Copy link
Collaborator Author

I have added a commit that changes the renew loop to make its timing more consistent:

  • now uses a dedicated thread instead of an async task to avoid observed issues with high variability of await Time.Delay(x)
  • now uses a stopwatch to time successive iterations of the renew loop
  • has a trace statement so we can observe timing anomalies, such as excessive delays between iterations
  • uses async local tasks and a single task collection, instead of Task.ContinueWith and multiple collections, to make the code easier to read

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments on this latest iteration.

leaseTasks.Add(RenewShutdownLease(shutdownLease));
}

Task.WhenAll(leaseTasks).Wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an exception, the .Wait() method throws an AggregateException instead of throwing the original exception. I worry about whether this might cause unexpected regressions or make debugging more challenging. If you want to preserve the previous exception behavior, you should instead use .GetAwaiter().GetResult().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Indeed, we want to preserve the previous exceptions.

{
if (!renewResult.Result)
if (!await this.RenewLeaseAsync(lease))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your comments you specifically mentioned improving the reliability of resuming from delays by using a dedicated thread. However, we will temporarily jump off the dedicated thread and use worker pool threads here and everywhere else that we fall back to async/await. Is that going to be acceptable in terms of the goals of this latest reliability iteration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we may still depend on the worker thread pool here, which is not ideal. But I was reluctant to make the code more complicated given that I am not completely sure in the first place that thread pool starvation is really the reason of why Task.Delay has such a high variance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I don't think we need to block on this and I agree that it would increase complexity so long as you still feel comfortable with the value of this approach in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be the final iteration of tuning the lease renewal loop. But we have to start somewhere and at the moment the code looks reasonably simple to me.

@saguiitay
Copy link
Contributor

saguiitay commented Apr 28, 2022

I would suggest doing the refactoring in a separate PR. This looks like an unnecessary risk. Please consider first fixing the bug, and then improving the code (for readability/stability/performance).
That being said, I'll see if I can run it a few times to test...

@sebastianburckhardt
Copy link
Collaborator Author

Given the unfavorable risk/benefit ratio I have reverted the last commit (refactoring of the renewal loop) as you suggested, @saguiitay.

I will pursue this in a separate PR at a later point of time.

@sebastianburckhardt sebastianburckhardt merged commit 516f999 into main Apr 28, 2022
@saguiitay
Copy link
Contributor

@sebastianburckhardt I wasn't able to fully test the change, but I did some testing. I manually deployed the DLLs with the change to a 3-nodes cluster:

  • I then ran MANY orchestrations on it, which all worked as expected
  • I then stopped one of the nodes, "forcing" the 2 remaining nodes to take leases for the leases.
  • I then ran MANY orchestrations on it, which all worked as expected
  • I then started the stopped node back to life - "forcing" it to steal the leases from other nodes
  • I then ran MANY orchestrations on it, which all worked as expected

So it does looks like things are working, but it's not a 100% test coverage.

@cgillum if you can release an updated version, I run it through our pipelines, which usually encounter the issue after 4-5 runs (usually take less than 2 days to appear)

@davidmrdavid davidmrdavid deleted the sebastian/fix-partition-management branch May 3, 2022 17:46
@cgillum
Copy link
Member

cgillum commented May 4, 2022

Thanks @saguiitay for helping with this validation! This gives me pretty good confidence that we can make a release to nuget.org with this fix. I'll try to do this in the next couple days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants