Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: fix transaction handling in state store #9438

Merged
merged 8 commits into from
Nov 25, 2020
8 changes: 5 additions & 3 deletions nomad/state/state_store.go
Original file line number Diff line number Diff line change
Expand Up @@ -2122,7 +2122,9 @@ func (s *StateStore) CSIVolumeByID(ws memdb.WatchSet, namespace, id string) (*st
if err != nil {
return nil, fmt.Errorf("volume lookup failed: %s %v", id, err)
}
ws.Add(watchCh)
if ws != nil {
ws.Add(watchCh)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this is explicit, I always get nervous when handling watchset. Want to note that this is not required - many other cases don't do a nil check, and apparently it's handled in Add: https://github.com/hashicorp/go-memdb/blob/v1.3.0/watch.go#L16-L20

Interestingly, this commit adds a new ws.Add without doing a nil check ;-).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and it's a documented part of the ws.Add method. Maybe we should just remove these extra checks then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 88a7e71

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's interesting... removing those caused NPE errors at those lines in the test runs for jobVersionByID!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a928e1]

goroutine 54 [running]:
github.com/hashicorp/nomad/nomad/state.(*StateStore).jobVersionByID(0xc00062b470, 0xc00057e180, 0x0, 0xc00054b630, 0x7, 0xc000551380, 0x28, 0x0, 0x0, 0x0, ...)
	github.com/hashicorp/nomad/nomad/state/state_store.go:1842 +0x161
github.com/hashicorp/nomad/nomad/state.(*StateStore).upsertJobVersion(0xc00062b470, 0x7, 0xc00064e900, 0xc00057e180, 0x0, 0x0)
	github.com/hashicorp/nomad/nomad/state/state_store.go:1753 +0x249
github.com/hashicorp/nomad/nomad/state.(*StateStore).upsertJobImpl(0xc00062b470, 0x7, 0xc00064e6c0, 0x35de500, 0xc00057e180, 0x0, 0x1)
	github.com/hashicorp/nomad/nomad/state/state_store.go:1523 +0x42c
github.com/hashicorp/nomad/nomad/state.(*StateStore).UpsertJob(0xc00062b470, 0x4, 0x7, 0xc00064e6c0, 0x0, 0x0)
	github.com/hashicorp/nomad/nomad/state/state_store.go:1449 +0xa8
github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpsertJob(0xc0005f5030, 0x4, 0xc000391801, 0x6b4, 0xbb3, 0x7, 0x0, 0x0)
	github.com/hashicorp/nomad/nomad/fsm.go:507 +0x205
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc0005f5030, 0xc000241a90, 0x51dd880, 0xbfe7b879c156277c)
	github.com/hashicorp/nomad/nomad/fsm.go:219 +0x34e
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc0006a43e0)
	github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000692e00, 0x1, 0x40)
	github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc0005a8000)
	github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc0005a8000, 0xc0004c4100)
	github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
	github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:144 +0x66
    jobs_test.go:2176: err: Put "http://127.0.0.1:31070/v1/jobs": EOF

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I must be missing something here, because testing with the actual go-memdb library shows otherwise...

func TestNilWatch(t *testing.T) {

	test := func(ws WatchSet) {
		watchCh := make(chan struct{})
		ws.Add(watchCh)
	}
	test(nil)
	test(NewWatchSet())
}
$ go test . -v -run TestNilWatch -count=1
=== RUN   TestNilWatch
--- PASS: TestNilWatch (0.00s)
PASS
ok      github.com/hashicorp/go-memdb   0.025s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's because jobVersionByID's signature is:

(s *StateStore) jobVersionByID(txn *txn, ws *memdb.WatchSet, namespace, id string) ([]*structs.Job, error)

I don't think we're supposed to be taking that pointer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting jobVersionByID takes a *memdb.WatchSet unlike the other methods in this file. It was introduced as such since the beginning in https://github.com/hashicorp/nomad/pull/2566/files#diff-53c7d524897a192ece819816a9dcf006dee46747fbb853e7f638601ea556a72cR561-R572 - but without explanation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's the only place in the code base where we do that, including Consul and Vault, I suspect it's just a long-lived typo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - I missed the last comment when I posted mine - I agree, it shouldn't be a pointer here.


if obj == nil {
return nil, nil
Expand Down Expand Up @@ -2208,6 +2210,7 @@ func (s *StateStore) CSIVolumesByNodeID(ws memdb.WatchSet, nodeID string) (memdb
}
iter.Add(raw)
}
ws.Add(iter.WatchCh())

return iter, nil
}
Expand All @@ -2229,7 +2232,6 @@ func (s *StateStore) CSIVolumesByNamespace(ws memdb.WatchSet, namespace string)
func (s *StateStore) CSIVolumeClaim(index uint64, namespace, id string, claim *structs.CSIVolumeClaim) error {
txn := s.db.WriteTxn(index)
defer txn.Abort()
ws := memdb.NewWatchSet()

row, err := txn.First("csi_volumes", "id", namespace, id)
if err != nil {
Expand All @@ -2246,7 +2248,7 @@ func (s *StateStore) CSIVolumeClaim(index uint64, namespace, id string, claim *s

var alloc *structs.Allocation
if claim.State == structs.CSIVolumeClaimStateTaken {
alloc, err = s.AllocByID(ws, claim.AllocationID)
alloc, err = s.allocByIDImpl(txn, nil, claim.AllocationID)
if err != nil {
s.logger.Error("AllocByID failed", "error", err)
return fmt.Errorf(structs.ErrUnknownAllocationPrefix)
Expand Down