Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 'batch mode' for persisting allocations on the client. #9093

Merged
merged 2 commits into from
Oct 20, 2020

Conversation

ashtuchkin
Copy link
Contributor

Fixes #9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that opportunistically combines multiple parallel writes into a small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.

@hashicorp-cla
Copy link

hashicorp-cla commented Oct 14, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ashtuchkin!

The concept of this change looks good. From an implementation perspective we generally try to avoid having boolean flags, especially in a case like this where the flag ends up having to be passed down thru a few layers of method calls to actually be used. I've marked the specific changes I'd recommend in the review.

(As an aside, I was wondering why we don't see this kind of problem in the server and of course it turns out that raft is batching changes. So the concept here is certainly solid and has precedent in Nomad.)

I'd also like to flag this change for @schmichael and @nickethier because I know they're working on an experiment that ran into client performance issues, so it might be of interest to them.

@@ -877,36 +877,37 @@ func (ar *allocRunner) destroyImpl() {
ar.destroyedLock.Unlock()
}

func (ar *allocRunner) PersistState() error {
func (ar *allocRunner) PersistState(inBatch bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one caller for PersistState, so this argument is always true. Let's remove this flag.

func (s *BoltStateDB) PutAllocation(alloc *structs.Allocation) error {
return s.db.Update(func(tx *boltdd.Tx) error {
func (s *BoltStateDB) PutAllocation(alloc *structs.Allocation, inBatch bool) error {
return s.updateOrBatch(inBatch, func(tx *boltdd.Tx) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having the boolean behavior flag and this function indirection, let's make the signature really super clear by splitting out a PutAllocation and a PutAllocationBatch, both of which call into a putAllocationImpl with the appropriate boltdb function.

Same for PutNetworkStatus and DeleteAllocationBucket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. The naming here is tricky; I'd definitely expect PutAllocationBatch to have a slice of allocations as the argument, not a single allocation.

That makes me think - if we add a new method to the interface anyway, maybe we can make it an actual "batch" variant that explicitly saves multiple allocations and not rely on hidden state using boltdb's Batch method? Wdyt?
This would require changing PersistState to PersistStateBatch too, but that shouldn't be hard to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can make it an actual "batch" variant that explicitly saves multiple allocations

I'm open to the idea of this but given how we run PersistState by iterating over allocation runners (which requires setting and releasing locks on those runner), I suspect that's not going to be very nice for the callers. Give it a go if you'd like though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah handling destroyedLock and stateLock might be a problem with real batching.

I've fleshed out the two-method approach you suggested and it doesn't look tidy enough to my taste, so wanted to circle back for advice. Conceptually, that boolean flag is an adjustment in behavior (basically opt-in to a perf optimization), not a new behavior, so it looks weird when I have to duplicate more or less the same code in several places. It looks like this:

	// PutAllocation stores an allocation or returns an error if it could
	// not be stored.
	PutAllocation(*structs.Allocation) error

	// PutAllocationInBatch has exactly the same semantics as PutAllocation above,
	// with the addition that the write transactions can be shared with other concurrent
	// writes, increasing performance in batch scenarios.
	PutAllocationInBatch(*structs.Allocation) error

	// Get/Put NetworkStatus get and put the allocation's network
	// status. It may be nil.
	GetNetworkStatus(allocID string) (*structs.AllocNetworkStatus, error)
	PutNetworkStatus(allocID string, ns *structs.AllocNetworkStatus) error

	// PutNetworkStatusInBatch has exactly the same semantics as PutNetworkStatus above,
	// with the addition that the write transactions can be shared with other concurrent
	// writes, increasing performance in batch scenarios.
	PutNetworkStatusInBatch(*structs.Allocation) error

So while I agree that using booleans to change behavior is a red flag, in this particular case it's arguably not changing semantics, just performance in particular scenarios.

I saw another approach to implementing such behavior adjustments using variadic options, something like this:

PutAllocation(*structs.Allocation, ...WriteOption) error

(see e.g. https://medium.com/soon-london/variadic-configuration-functions-in-go-8cef1c97ce99 for details)

That would avoid polluting cases where no such option is needed, while allowing these adjustments where needed. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would avoid polluting cases where no such option is needed, while allowing these adjustments where needed. What do you think?

Oh, that's super nice. Go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think it's ready for review.

@ashtuchkin
Copy link
Contributor Author

Thanks for the review Tim! Will make the changes soon.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for submitting this @ashtuchkin. I'll let @tgross take care of reviewing and merging.

Have you confirmed this fixes your IO issue @ashtuchkin?

I'm pretty concerned the de-duplication in boltdd isn't helping. Seems like the transactional overhead may remove the IO benefit of the de-duplication in which case we may just be wasting CPU/memory doing it!

Temporary metrics or logging around de-duplication hits/misses may help tell if de-duping is happening. Sadly at a glance it seems like if the transaction overhead does destroy the theoretical IO savings it would take quite a bit of reworking to only opportunistically create the transaction on change.

In hindsight doing de-duplication at the StateDB level may have made more sense.

@ashtuchkin
Copy link
Contributor Author

Thanks Michael!

Have you confirmed this fixes your IO issue @ashtuchkin?

Yes, I've deployed this version on my cluster and the IOPS does return to normal when the cluster becomes idle.

I'm pretty concerned the de-duplication in boltdd isn't helping. Seems like the transactional overhead may remove the IO benefit of the de-duplication in which case we may just be wasting CPU/memory doing it!

I traced commit path in boltdb and it does 2 unconditional fdatasync() calls per commit (unless NoSync=true), see Tx.write, Tx.writeMeta, called directly from Tx.Commit in https://github.com/boltdb/bolt/blob/master/tx.go. No writes happen if the transaction is discarded, though, so one option would be to discard empty transactions (I wonder if we can do it automatically at boltdd level?).

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @ashtuchkin! Thanks so much for your patience in getting this into shape.

I've left one question about the test assertion.

// and MaxBatchSize parameters).
if getTxID != nil {
numTransactions := getTxID() - prevTxID
require.Less(numTransactions, 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this "< 10 transactions" determined empirically? The default MaxBatchSize is 1000, which implies we can assert "at least 2" but I'm less certain about the "at most" we can assert. Can we use a timer to figure out how many 10ms periods have passed to make a stronger assertion?

I just want to make sure we're not introducing a potential test flake (this has been a pain point), especially for full unit test runs with parallelism on the VMs used for CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep flake tests is a common pain point - I've wasted so much time on it too.

You're right, most of the time when running this test we get 2 transactions, each with 1000 writes. Time is usually not a limiting factor here – spawning goroutines is pretty fast and writes are all in-memory (the only I/O we have is when these 2 transactions are committed). I used 10 as a "safe" threshold that should hold even under load, but that's of course arbitrary.

I initially opted against adding a more specific "expected number of transactions" calculation here because that would add coupling to specific algorithm that boltdb uses, in addition to adding complexity. Now that I'm thinking about, the first concern is probably not relevant as boltdb is frozen, so ok, let's do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think it's ready for review. I've added a separate commit for this change so that it's easier to see what exactly changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Fixes hashicorp#9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that combines multiple
parallel writes into small number of transactions. See
https://github.com/boltdb/bolt#batch-read-write-transactions for
more information.
@schmichael
Copy link
Member

I traced commit path in boltdb and it does 2 unconditional fdatasync() calls per commit (unless NoSync=true), see Tx.write, Tx.writeMeta, called directly from Tx.Commit in https://github.com/boltdb/bolt/blob/master/tx.go. No writes happen if the transaction is discarded, though, so one option would be to discard empty transactions (I wonder if we can do it automatically at boltdd level?).

@ashtuchkin That makes complete sense, and yes! Boltdd can definitely do that tracking. Should be as easy as adding a dirty bool to boltdd.Tx and making every mutation method that isn't de-duped set tx.dirty = true. Then boltdd.Tx can That also requires passing boltdd.Tx to the boltdd.Bucket. Since transactions aren't safe for concurrent use, there's no need to worry about locking around the dirty flag.

Probably worth it's own PR since this one is so close to merging. If you don't have time to do that PR, feel free to file an issue or let me know and I can file an issue.

@tgross tgross merged commit 1be5243 into hashicorp:master Oct 20, 2020
tgross added a commit that referenced this pull request Oct 20, 2020
@tgross tgross added this to the 1.0 milestone Oct 20, 2020
@tgross
Copy link
Member

tgross commented Oct 20, 2020

@ashtuchkin I've merged this and got it onto the changelog in #9132. This will ship in Nomad 1.0! Thanks again!

@ashtuchkin
Copy link
Contributor Author

Awesome, thank you Tim!

tgross added a commit that referenced this pull request Oct 20, 2020
fredrikhgrelland pushed a commit to fredrikhgrelland/nomad that referenced this pull request Oct 22, 2020
…hicorp#9093)

Fixes hashicorp#9047, see problem details there.

As a solution, we use BoltDB's 'Batch' mode that combines multiple
parallel writes into small number of transactions. See
https://github.com/boltdb/bolt#batch-read-write-transactions for
more information.
fredrikhgrelland pushed a commit to fredrikhgrelland/nomad that referenced this pull request Oct 22, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad Client generates lots of IOPS when idle, saturating HDD
4 participants