Fix GC'd alloc tracking #3445

schmichael · 2017-10-25T21:14:24Z

The two important changes:

The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs.
Client.NumAllocs - which is called by the GC to determine if it the node is over the max allocs - properly counts all non-GC'd allocs.

Removes error returns from lots of methods that never actually returned errors. Actual GC'ing by the AllocRunner is a best-effort, so there's rarely a meaningful error to return in those call chains.

Testing the GC in a meaningful way is very difficult. A couple of the existing unit tests made incorrect assumptions about how the units they were testing (the GC's max allocs calculation) behaved, and therefore weren't actually testing any realistic behavior. This is what allowed bugs to persist in the face of what would appear to be reasonable test coverage. I replaced a couple of those small unit tests with a a larger integration test (integration in that it spins up a server & client and black box tests the gc).

Also stops logging "marked for GC" twice.

preetapan · 2017-10-26T15:04:02Z

client/alloc_runner.go

-	// runner shutting down. Since handleDestroy can be called by Run() we
-	// can't block shutdown here as it would cause a deadlock.
-	go r.updater(alloc)
+	r.updater(alloc)


Can you explain why the updater is now a blocking call, given the deleted comment about it causing a deadlock if its a blocking call

A great question that took me a long time to figure out!

The short answer is: because I couldn't find a reason for it, and it works without it being in a goroutine.

The longer answer is much more fulfilling though:

When I first made this hack in #2852, the Client.updateAllocStatus method that this line calls could trigger a blocking GC of this alloc. However that was fixed in #3007 -- Client.updateAllocStatus was greatly simplified and can no longer trigger a blocking GC.

It can still however cause a concurrent GC, so I should actually move this update call after the final saveAllocRunnerState() call to ensure we don't try to save state after a GC. Thanks to proper locking the worst case scenario should be a noop, but in the past state-saving-after-GC was a source of state corruption errors. So the more defensive we can be the better!

tl;dr - minor updates incoming

dadgar · 2017-10-26T20:42:15Z

client/client.go

@@ -1205,6 +1211,7 @@ func (c *Client) updateNodeStatus() error {
 	for _, s := range resp.Servers {
 		addr, err := resolveServer(s.RPCAdvertiseAddr)
 		if err != nil {
+			c.logger.Printf("[DEBUG] client: ignoring invalid server %q: %v", s.RPCAdvertiseAddr, err)


dadgar · 2017-10-26T20:49:08Z

client/client.go

@@ -1234,8 +1241,14 @@ func (c *Client) updateNodeStatus() error {
 // updateAllocStatus is used to update the status of an allocation
 func (c *Client) updateAllocStatus(alloc *structs.Allocation) {
 	if alloc.Terminated() {
-		// Terminated, mark for GC
-		if ar, ok := c.getAllocRunners()[alloc.ID]; ok {
+		// Terminated, mark for GC iff we're still tracking this alloc


dadgar · 2017-10-26T20:50:22Z

client/client.go

 	go c.garbageCollector.Collect(alloc.ID)

-	return nil
+	return


Remove this?

dadgar · 2017-10-26T20:54:43Z

client/gc.go

-	a.destroyAllocRunner(gcAlloc.allocRunner, "forced collection")
-	return nil
+
+	a.logger.Printf("[DEBUG] client.gc: alloc %s already garbage collected", allocID)


This isn't necessarily true right? I could do Collect("hello-world") and this would log it was already collected. I would log that it isn't be tracked by the garbage collector

Ah, true. This is in the call chain of the GC Alloc API so users could pass in arbitrary data. Will clarify.

dadgar · 2017-10-26T21:01:18Z

client/gc.go

 func (a *AllocGarbageCollector) numAllocs() int {
-	return a.allocRunners.Length() + a.allocCounter.NumAllocs()
+	return a.allocCounter.NumAllocs()


Just delete method and use the allocCounter?

dadgar · 2017-10-26T21:05:52Z

client/gc_test.go

+	job.TaskGroups[0].Tasks[0].Driver = "mock_driver"
+	job.TaskGroups[0].Tasks[0].Config["run_for"] = "30s"
+	nodeID := client.Node().ID
+	if err := state.UpsertJob(98, job); err != nil {


Use the assert library

I did at first and then got really confusing test failures. I didn't realize The assert library uses t.Error instead of t.Fatal and you need to check its return bool to know whether or not to fail!

So I just skipped the assert library as it didn't seem to add anything.

dadgar · 2017-10-27T21:10:36Z

@schmichael Actually the test appears to be flaky:

--- FAIL: TestAllocGarbageCollector_MakeRoomForAllocations_MaxAllocs (140.67s)

	gc_test.go:335: 2 alloc state: expected 7 allocs (found 7); expected 1 destroy (found 4)```

dadgar · 2017-10-27T21:43:03Z

TestHTTP_AllocGC_ACL seems to be broken in this branch only as well

The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.

GC much more aggressively by triggering GCs when allocations become terminal as well as after new allocations are added.

schmichael · 2017-10-30T22:20:57Z

Fixed TestHTTP_AllocGC_ACL and it TestAllocGarbageCollector_MakeRoomForAllocations_MaxAllocs has remained passing for at least 3 Travis builds after I made some tweaks.

Merging to 0.7.1. Yell at me if it proves flaky.

github-actions · 2023-03-19T02:18:14Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael mentioned this pull request Oct 25, 2017

Feature request: detailed allocation GC docs and logging of allocation state before GC #3421

Open

preetapan reviewed Oct 26, 2017

View reviewed changes

dadgar requested changes Oct 26, 2017

View reviewed changes

schmichael force-pushed the b-gc branch from 5211ce8 to 45080ab Compare October 26, 2017 21:46

schmichael changed the title ~~Fix GC'd alloc tracking~~ [WIP] Fix GC'd alloc tracking Oct 26, 2017

schmichael changed the base branch from b-nomad-0.7.1 to master October 26, 2017 21:47

schmichael force-pushed the b-gc branch 2 times, most recently from 0d4ec58 to 307a1e5 Compare October 26, 2017 23:41

schmichael changed the title ~~[WIP] Fix GC'd alloc tracking~~ Fix GC'd alloc tracking Oct 26, 2017

schmichael force-pushed the b-gc branch from 307a1e5 to c7a654d Compare October 27, 2017 00:28

dadgar approved these changes Oct 27, 2017

View reviewed changes

schmichael force-pushed the b-gc branch 3 times, most recently from 54b3c03 to 4eeb673 Compare October 30, 2017 16:43

schmichael added 4 commits October 30, 2017 11:22

Fix GC'd alloc tracking

62d103a

The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.

Trigger GCs after alloc changes

fefa4d0

GC much more aggressively by triggering GCs when allocations become terminal as well as after new allocations are added.

Fix race in test

260caf2

Fix regression by returning error on unknown alloc

6c36f76

schmichael force-pushed the b-gc branch from 4eeb673 to 6c36f76 Compare October 30, 2017 18:22

schmichael changed the base branch from master to b-nomad-0.7.1 October 30, 2017 22:19

schmichael merged commit 84b9c3e into b-nomad-0.7.1 Oct 30, 2017

schmichael deleted the b-gc branch October 30, 2017 22:21

schmichael added a commit that referenced this pull request Nov 1, 2017

Add #3445 to CHANGELOG

14b601b

schmichael added a commit that referenced this pull request Nov 2, 2017

Add #3445 to changelog

350c043

schmichael mentioned this pull request Nov 2, 2017

Memory leak #3420

Closed

chelseakomlo pushed a commit that referenced this pull request Nov 3, 2017

Add #3445 to changelog

eb67ab0

schmichael mentioned this pull request Dec 13, 2017

Nodes won't garbage collect #3658

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GC'd alloc tracking #3445

Fix GC'd alloc tracking #3445

schmichael commented Oct 25, 2017 •

edited

Loading

preetapan Oct 26, 2017 •

edited

Loading

schmichael Oct 26, 2017

dadgar Oct 26, 2017

dadgar Oct 26, 2017

dadgar Oct 26, 2017

dadgar Oct 26, 2017

schmichael Oct 26, 2017

dadgar Oct 26, 2017

dadgar Oct 26, 2017

schmichael Oct 26, 2017

dadgar commented Oct 27, 2017

dadgar commented Oct 27, 2017

schmichael commented Oct 30, 2017

github-actions bot commented Mar 19, 2023

Fix GC'd alloc tracking #3445

Fix GC'd alloc tracking #3445

Conversation

schmichael commented Oct 25, 2017 • edited Loading

preetapan Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dadgar commented Oct 27, 2017

dadgar commented Oct 27, 2017

schmichael commented Oct 30, 2017

github-actions bot commented Mar 19, 2023

schmichael commented Oct 25, 2017 •

edited

Loading

preetapan Oct 26, 2017 •

edited

Loading