-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for new partition management #717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this fix!
@@ -115,18 +115,21 @@ public Table GetTableReference(string tableName) | |||
public Task<T> MakeTableStorageRequest<T>(Func<OperationContext, CancellationToken, Task<T>> storageRequest, string operationName, string? clientRequestId = null) => | |||
this.MakeStorageRequest<T>(storageRequest, TableAccountName, operationName, clientRequestId); | |||
|
|||
public Task MakeBlobStorageRequest(Func<OperationContext, CancellationToken, Task> storageRequest, string operationName, string? clientRequestId = null) => | |||
this.MakeStorageRequest(storageRequest, BlobAccountName, operationName, clientRequestId); | |||
public Task MakeBlobStorageRequest(Func<OperationContext, CancellationToken, Task> storageRequest, string operationName, string? clientRequestId = null, bool throttle = true) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking suggestion: I think it's more common in .NET API design to name a variable like this as force
with a default value of false
.
Can this be completed and released, so that I can test it? |
Ideally, you could help us test these changes before they go in the next release, it would give us confidence we are not breaking anything. Is this possible? @cgillum, what do you think? I also have two more modifications to the partition manager I wanted to add and test. But perhaps I should use a separate PR. |
I would definitely prefer having some validation before doing a release. @saguiitay are you able to test a private version of this deployed to myget.org? |
I have added a commit that changes the renew loop to make its timing more consistent:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments on this latest iteration.
leaseTasks.Add(RenewShutdownLease(shutdownLease)); | ||
} | ||
|
||
Task.WhenAll(leaseTasks).Wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an exception, the .Wait()
method throws an AggregateException
instead of throwing the original exception. I worry about whether this might cause unexpected regressions or make debugging more challenging. If you want to preserve the previous exception behavior, you should instead use .GetAwaiter().GetResult()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Indeed, we want to preserve the previous exceptions.
{ | ||
if (!renewResult.Result) | ||
if (!await this.RenewLeaseAsync(lease)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your comments you specifically mentioned improving the reliability of resuming from delays by using a dedicated thread. However, we will temporarily jump off the dedicated thread and use worker pool threads here and everywhere else that we fall back to async/await. Is that going to be acceptable in terms of the goals of this latest reliability iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we may still depend on the worker thread pool here, which is not ideal. But I was reluctant to make the code more complicated given that I am not completely sure in the first place that thread pool starvation is really the reason of why Task.Delay has such a high variance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I don't think we need to block on this and I agree that it would increase complexity so long as you still feel comfortable with the value of this approach in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be the final iteration of tuning the lease renewal loop. But we have to start somewhere and at the moment the code looks reasonably simple to me.
I would suggest doing the refactoring in a separate PR. This looks like an unnecessary risk. Please consider first fixing the bug, and then improving the code (for readability/stability/performance). |
This reverts commit c86d432.
Given the unfavorable risk/benefit ratio I have reverted the last commit (refactoring of the renewal loop) as you suggested, @saguiitay. I will pursue this in a separate PR at a later point of time. |
@sebastianburckhardt I wasn't able to fully test the change, but I did some testing. I manually deployed the DLLs with the change to a 3-nodes cluster:
So it does looks like things are working, but it's not a 100% test coverage. @cgillum if you can release an updated version, I run it through our pipelines, which usually encounter the issue after 4-5 runs (usually take less than 2 days to appear) |
Thanks @saguiitay for helping with this validation! This gives me pretty good confidence that we can make a release to nuget.org with this fix. I'll try to do this in the next couple days. |
This fixes a bug in the partition management logic for ownership leases: all available ownership leases (for partitions that are expired and for which intent leases are owned) must be acquired, regardless of how many nodes there are and how many partitions have been already acquired. The balancing is already taken care of by the intent leases.
Also, this PR adds an exclusion so lease renewals are not subjected to request throttling; because request throttling on lease renewals can be counterproductive in heavily loaded situations.