Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool: track usage of incoming streams #10710

Merged
merged 2 commits into from
Jun 7, 2021
Merged

pool: track usage of incoming streams #10710

merged 2 commits into from
Jun 7, 2021

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 7, 2021

Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where alloc exec and alloc fs sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

I've tested this code in a live cluster, and had an alloc exec command session almost 3 hours before I killed it. Without the change, an alloc exec session used to last around 7-20 minutes or so.

Fixes #10579

Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.
@notnoop notnoop added this to the 1.1.1 milestone Jun 7, 2021
@notnoop notnoop self-assigned this Jun 7, 2021
@notnoop notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 7, 2021
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, LGTM. I've left some a couple of comments where I think we could make the existing code harder for developers to break, but I'll leave that to your judgement.

client/rpc.go Outdated
@@ -301,7 +300,7 @@ func (c *Client) rpcConnListener() {

// listenConn is used to listen for connections being made from the server on
// pre-existing connection. This should be called in a goroutine.
func (c *Client) listenConn(s *yamux.Session) {
func (c *Client) listenConn(s *pool.Conn) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiniest of nitpicks, but with the type being changed to *pool.Conn, naming this argument s doesn't make sense anymore. Maybe rename the arg to p?

(pool.Conn is kind of a weirdly named type in general because it's more like "connection factory" or "connect proxy" given that Accept() returns net.Conn, but it isn't a "connection pool" either as it contains the connection pool. But probably best not to rework the whole thing. 😀 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to c. FWIW, pool.Conn is the underlying physical connection - pool.Accept() returns a wrapped yamux.Session which also implements net.Conn (and clients expect)..

@@ -461,7 +496,7 @@ func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string,
p.clearConn(conn)
}

p.releaseConn(conn)
conn.releaseUse()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this PR I'm noticing we have an existing hidden temporal coupling of conn.releaseUse() to getRPCClient. If someone were to call conn.releaseUse outside of this method without first calling p.clearConn(conn), we can get a value of -1 for the reference count and then not close the connection. The current code is correct but easy for someone to break.

Two suggestions:

  • Move this into a defer conn.releaseUse() right after we check the error from getRPCClient to make the temporal coupling more explicit.
  • Maybe make releaseUse tolerant of misuse by having it close on refCount < 1:
// releaseUse is the complement of `markForUse`, to free up the reference count
func (c *Conn) releaseUse() {
	refCount := atomic.AddInt32(&c.refCount, -1)
	if refCount < 1 && atomic.LoadInt32(&c.shouldClose) == 1 {
		c.Close()	
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's brittle, I'll move to defer.

I'm unsure about handling negative refCounts. It's unclear how refCount can be negative by re-ordering calls. Only conn.releaseUse decrement the refCount. So negative values indicate double-release bugs, and defensive handling will probably lead to yet more subtle cases of unexpected connection closing like this one. I wish we can simply panic on negative values so we can find the source of double free case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So negative values indicate double-release bugs, and defensive handling will probably lead to yet more subtle cases of unexpected connection closing like this one.

That's a good point. Totally agreed on that.

@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2021
defer releaseUse to ensure it gets run

Also, clarify conn vs stream naming a little bit
@vercel vercel bot temporarily deployed to Preview – nomad June 7, 2021 13:56 Inactive
@notnoop notnoop merged commit 3f7a5c1 into main Jun 7, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 7, 2021
@notnoop notnoop deleted the b-pool-streaming-used branch June 7, 2021 14:22
notnoop pushed a commit that referenced this pull request Jun 7, 2021
Co-authored-by: Tim Gross <tgross@hashicorp.com>
notnoop pushed a commit that referenced this pull request Jun 9, 2021
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
notnoop pushed a commit that referenced this pull request Jun 9, 2021
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alloc Exec spontaneously disconnects after some time
2 participants