Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in StreamFramer and truncation in api/AllocFS.Logs #4234

Merged
merged 10 commits into from
May 4, 2018
Merged

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Apr 30, 2018

Fixes #4159

This was extremely difficult to debug since there is an incredible amount of indirection, proxying, and copying all in code that tries to avoid unnecessary copying because it's proxying a potentially significant amount of data! Luckily I added a test to api/fs_test.go that demonstrated the bug. Sadly it's slow (6-8s on my Vagrantbox).

The two bugs affecting #4159 are easiest to review by looking at their commits alone:

  • Fix race in api/ between sending frames and closing: 41f05dc
  • Fix race in StreamFramer due to unintentional sharing of a slice's backing array: 361db26
  • Sadly I introduced a new truncation error in StreamFramer which is fixed here: 6858c52

}
s.l.Unlock()
}

// send takes a StreamFrame, encodes and sends it
func (s *StreamFramer) send(f *StreamFrame) error {
sending := *f
f.Data = nil
Copy link
Member Author

@schmichael schmichael Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't let this fool you as it fooled me and everyone who touched it before! 😅

Nil'ing the slice doesn't prevent a race. sending.Data will have the same backing array as f.Data used to. This means the next call to s.readData operates on the same backing array as sending.Data is reading in another goroutine!

At least that's my best explanation (and reverting this file immediately induces the bug).

api/fs.go Outdated
@@ -315,8 +320,11 @@ func (a *AllocFS) Logs(alloc *Allocation, follow bool, task, logType, origin str
// Decode the next frame
var frame StreamFrame
if err := dec.Decode(&frame); err != nil {
errCh <- err
close(frames)
if err == io.EOF {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put my review comment in the wrong place so repasting it here:

Shouldn't we also close the channel when the error is ErrClosedPipe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think due to how we currently Close the pipe that can't happen, but there's no reason not to be defensive.


//TODO There should be a way to connect the client to the servers in
//makeClient above
require.NoError(c.Agent().SetServers([]string{fmt.Sprintf("127.0.0.1:%d", rpcPort)}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't makeClient already do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't! This makeClient helper is very strange and all of the tests in api/ that expect a client node seem to be commented out! For example: https://github.com/hashicorp/nomad/blob/master/api/allocations_test.go#L33-L59

I'm honestly not sure what the right approach is here. Either we need to improve this test client or we need to remove it and use one of the other test client helpers we have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test client just serializes a config and starts the agent with it. By default it only creates the server. You may want to consider just using a dev agent.

@schmichael schmichael force-pushed the b-4159 branch 2 times, most recently from d0cc37d to 1b34922 Compare May 1, 2018 16:50
Closing the frames chan is the only race-free way to signal to receivers
that all frames have been sent and no errors have occurred.

If EOF is sent on error chan receivers may not receive the last frame
(or frames since the chan is buffered) before receiving the error.

Closing frames is the idiomatic way of signaling there is no more data
to be read from a chan.
According to go/codec's docs, Reset(...) should be called on
Decoders/Encoders before reuse:

https://godoc.org/github.com/ugorji/go/codec

I could find no evidence that *not* calling Reset() caused bugs, but
might as well do what the docs say?
In the old code `sending` in the `send()` method shared the Data slice's
underlying backing array with its caller. Clearing StreamFrame.Data
didn't break the reference from the sent frame to the StreamFramer's
data slice.
return
}
decoder.Reset(httpPipe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look related to the race, was this all general cleanup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I was torn on whether to include it or not.

When testing adding or removing these lines had no impact on truncation or ordering issues.

However, they also didn't hurt and the docs say Reset should be called when "reusing" an encoder/decoder? https://godoc.org/github.com/ugorji/go/codec (search for Reset)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with the team, and we're leaning toward keeping it Just In Case.

@schmichael schmichael merged commit f2f6dab into master May 4, 2018
@schmichael schmichael deleted the b-4159 branch May 4, 2018 21:24

//TODO There should be a way to connect the client to the servers in
//makeClient above
require.NoError(c.Agent().SetServers([]string{fmt.Sprintf("127.0.0.1:%d", rpcPort)}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test client just serializes a config and starts the agent with it. By default it only creates the server. You may want to consider just using a dev agent.

n := new(StreamFrame)
*n = *s
n.Data = make([]byte, len(s.Data))
copy(n.Data, s.Data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the race on the backing slice I understand how this fixes the issue but it is rather unfortunate that we are copying the data! I believe another fix would be for the out channel of the framer to be a wrapper:

type FrameFuture struct {
  Frame *StreamFrame
  FrameSent chan struct{}
}

func (f *FrameFuture) Sent() {
  close(f.FrameSent)
}

And then the caller can call Sent on the future after it has been flushed: https://github.com/hashicorp/nomad/pull/4234/files#diff-03a99c783d6606b25fccfc5f3d26092bR470

Then send() can select on that and the exit channel. This way we can avoid copying.

@@ -423,20 +422,23 @@ func (f *FileSystem) logs(conn io.ReadWriteCloser) {
go func() {
for {
if _, err := conn.Read(nil); err != nil {
if err == io.EOF {
if err == io.EOF || err == io.ErrClosedPipe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dadgar dadgar May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here 51440e2

@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nomad logs: output is incomplete and disordered
3 participants