Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client/Server TLS dynamic reload #3492

Merged
merged 24 commits into from
Jan 23, 2018
Merged

Client/Server TLS dynamic reload #3492

merged 24 commits into from
Jan 23, 2018

Conversation

chelseakomlo
Copy link
Contributor

@chelseakomlo chelseakomlo commented Nov 2, 2017

This PR includes:

  • Upgrading/downgrading from non-TLS connections to TLS via SIGHUP
  • Closing long-lived connections on reload (HTTP, RPC, Raft)

@chelseakomlo chelseakomlo force-pushed the f-client-tls-reload branch 5 times, most recently from 070e632 to e27c2db Compare November 3, 2017 20:48
@chelseakomlo chelseakomlo changed the title WIP: Client TLS dynamic reload Client TLS dynamic reload Nov 3, 2017
@chelseakomlo chelseakomlo changed the title Client TLS dynamic reload Client/Server TLS dynamic reload Nov 4, 2017
@chelseakomlo chelseakomlo force-pushed the f-client-tls-reload branch 2 times, most recently from be7abb1 to ed07520 Compare November 6, 2017 14:53
@chelseakomlo chelseakomlo changed the title Client/Server TLS dynamic reload WIP: Client/Server TLS dynamic reload Nov 6, 2017
@chelseakomlo chelseakomlo force-pushed the f-client-tls-reload branch 6 times, most recently from 392b066 to 576b595 Compare November 13, 2017 18:19
a.logger.Printf("[WARN] agent: Issue reloading the server's TLS Configuration, consider a full system restart: %v", err.Error())
return err, false
}
} else if c := a.Client(); c != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be running as both so don't use else if use another if

@chelseakomlo chelseakomlo changed the title WIP: Client/Server TLS dynamic reload Client/Server TLS dynamic reload Nov 21, 2017
Copy link
Member

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one pass through

client/client.go Outdated
@@ -363,6 +364,25 @@ func (c *Client) init() error {
return nil
}

// ReloadTLSConnectoins allows a client to reload RPC connections if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

client/client.go Outdated

return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only place that can throw any errors here the OutgoingTLSWrapper method?

QueryOptions: structs.QueryOptions{Region: "dc1"},
}
var out structs.SingleNodeResponse
testutil.AssertUntil(100*time.Millisecond,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this trying to establish, looks like it always expects Node.GetNode to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes several checks until either the request succeeds or it hits the predetermined timeout.

if err != nil {
return err
}
c.httpServer = http
Copy link
Member

@preetapan preetapan Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment seems unsafe without a mutex around it, what if two different calls to this reload method happen close enough in time in quick succession?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, fixing.

client/client.go Outdated
c.config.TLSConfig = newConfig
c.configLock.Unlock()

if c.config.TLSConfig.EnableRPC {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only checking for enableRPC and not for enableHttp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP connections are handled at the HTTP server level

c.agent.logger.Printf("[ERR] agent: failed to reload http server: %v", err)
return
}
c.agent.logger.Println("[INFO] agent: successfully restarted the HTTP server")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks identical to the reloadHTTPServerOnConfigChange method, and that method is unused, did you forget to make the replacement maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I did, fixed that up.


oldPool := p.pool
for _, conn := range oldPool {
conn.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ignoring the error from conn.Close(), but so is the rest of this file. Need some context from @dadgar or @schmichael as to why close errors are ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know there's nothing useful to do with Close errors. In some circumstances logging them might be useful, but I can't imagine what actionable information errors from this Close could contain? I don't think it's even useful for catching programming errors like closing closed connections as there's always the possibility the connection may close concurrently due to peer disconnects or other issues.

If somebody has an example of a useful Close error I'd love to know it!

close(n.shutdownCh)
n.shutdown = true
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue in the raft library and open a PR for this there and vendor the changes in. Otherwise, this will drift when we update raft.

@@ -472,6 +472,17 @@ NOTE: Dynamically reloading certificates will _not_ close existing connections.
If you need to rotate certificates due to a security incident, you will still
need to completely shutdown and restart the Nomad agent.

## Migrating a cluster to TLS

Nomad supports dynamically reloading a server to move from plaintext to TLS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nomad supports dynamically reloading it's TLS configuration. To reload Nomad's configuration, first update the configuration file and then send the Nomad agent a SIGHUP signal. Note that this will only reload a subset of the configuration file, including the TLS configuration.

When reloading the configuration, if there is a change to the TLS configuration, the agent will reload all network connections and when establishing new connections, will use the new configuration. This process
works for both upgrading and downgrading TLS (but we recommend upgrading).

client/client.go Outdated
var tlsWrap tlsutil.RegionWrapper
if newConfig != nil && newConfig.EnableRPC {
tw, err := c.config.NewTLSConfiguration(newConfig).OutgoingTLSWrapper()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line here

assert := assert.New(t)

s1, addr := testServer(t, func(c *nomad.Config) {
c.Region = "foo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further I don't think it is necessary to set it for this test

CertFile: foocert,
KeyFile: fookey,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to assert the node registered using plaintext?

defer c1.Shutdown()

newConfig := &nconfig.TLSConfig{}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert it isn't registered

// NewTLSConfiguration returns a TLSUtil Config for a new TLS config object
// This allows a TLSConfig object to be created without first explicitly
// setting it
func (c *Config) NewTLSConfiguration(tlsConfig *config.TLSConfig) *tlsutil.Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this and put it back to how it was. Further if you wanted to have these separate, this one shouldn't be a method on the client Config. It should be func NewTLSConfiguration

err := c.agent.Reload(newConf)
if err != nil {
c.agent.logger.Printf("[ERR] agent: failed to reload the config: %v", err)
shouldReload := c.agent.ShouldReload(newConf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe return a separate boolean for whether the http server should be reloaded? As we add stuff, you could potential reload the agent w/o needing to reload the HTTP server


func TestServer_Reload_TLS_WithNilConfiguration(t *testing.T) {
t.Parallel()
assert := assert.New(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving forward can you use https://godoc.org/github.com/stretchr/testify/require. It will hard fail on the first failed assertion

}

s := makeHTTPServer(t, func(c *Config) {
c.TLSConfig = newConfig.TLSConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have to set the region to regionFoo? I don't understand the test setup or how it is passing. The region is wrong and the server starts with TLS and it is being reloaded with the same config as it is originally started as

nomad/config.go Outdated
// newTLSConfig returns a TLSUtil Config based on the server configuration
// This is useful for creating a TLSConfig object without explictely setting it
// on the config.
func (c *Config) newTLSConfig(newConf *config.TLSConfig) *tlsutil.Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump, lets put this back to how it was

nomad/server.go Outdated
s.listenerCh = make(chan struct{})
list, err := net.ListenTCP("tcp", s.config.RPCAddr)
if err != nil || list == nil {
s.logger.Printf("[ERR] nomad: No TLS listener to reload")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the list == nil a valid case? Why not just check and log the error?

nomad/server.go Outdated

if !newConfig.TLSConfig.Equals(s.config.TLSConfig) {
if err := s.reloadTLSConnections(newConfig.TLSConfig); err != nil {
s.logger.Printf("[DEBUG] nomad: reloading server TLS configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wanted to log the error?

nomad/server.go Outdated
@@ -776,6 +884,7 @@ func (s *Server) setupRPC(tlsWrap tlsutil.RegionWrapper) error {
return err
}
s.rpcListener = list
s.listenerCh = make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use createRPCListener? There is duplicated code

nomad/server.go Outdated
tlsConf := s.config.newTLSConfig(newTLSConfig)
incomingTLS, tlsWrap, err := getTLSConf(newTLSConfig.EnableRPC, tlsConf)
if err != nil {
s.logger.Printf("[ERR] nomad: unable to reset TLS context")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the errors. Can you search through the PR and make sure all your logs have the errors logged.
"unable to reset TLS context" is slightly vague. Can you make it more specific to the fact that it is creating the TLS config that failed.

nomad/server.go Outdated
s.configLock.Unlock()

if s.rpcCancel == nil {
s.logger.Printf("[ERR] nomad: No TLS Context to reset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't about the context, it is there is no RPC servers. Structure so that the log reuses the returned error

nomad/server.go Outdated
// CLose existing streams
wrapper := tlsutil.RegionSpecificWrapper(s.config.Region, tlsWrap)
s.raftLayer = NewRaftLayer(s.rpcAdvertise, wrapper)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.raftTransport.Reload(s.raftLayer)?

Calls into questions the tests, if they are passing

nomad/server.go Outdated
s.startRPCListener()

time.Sleep(3 * time.Second)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not attached to the call site that returns the error

s.connPool.ReloadTLS(tlsWrap)

// reinitialize our rpc listener
s.rpcListener.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do all the RPC stuff and then (with a separate comment) start the Raft stuff

return trans
}

// Pause closes the current stream for a NetworkTransport instance
func (n *NetworkTransport) Pause() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer these to be called CloseStream and UseStream and document that calling CloseStream must be done before giving a new stream layer

// Accept incoming connections
conn, err := n.stream.Accept()
if err != nil {
if n.IsShutdown() {
return
}
// TODO Getting an error here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to check the ctx again as Accept is blocking

@chelseakomlo chelseakomlo force-pushed the f-client-tls-reload branch 4 times, most recently from 098f300 to 1dab7b5 Compare January 17, 2018 00:02
close second goroutine in raft-net
@chelseakomlo chelseakomlo force-pushed the f-client-tls-reload branch 2 times, most recently from a08ab45 to 231e9dd Compare January 18, 2018 13:01
nomad/server.go Outdated

time.Sleep(3 * time.Second)
time.Sleep(500 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

assert.NotNil(out)
assert.Equal(out.CreateIndex, resp.JobModifyIndex)
for _, serv := range servers {
testutil.WaitForResult(func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait for leader on positive cases

{"path":"gopkg.in/tomb.v1","checksumSHA1":"TO8baX+t1Qs7EmOYth80MkbKzFo=","revision":"dd632973f1e7218eb1089048e0798ec9ae7dceb8","revisionTime":"2014-10-24T13:56:13Z"},
{"path":"gopkg.in/tomb.v2","checksumSHA1":"WiyCOMvfzRdymImAJ3ME6aoYUdM=","revision":"14b3d72120e8d10ea6e6b7f87f7175734b1faab8","revisionTime":"2014-06-26T14:46:23Z"},
{"path":"gopkg.in/yaml.v2","checksumSHA1":"12GqsW8PiRPnezDDy0v4brZrndM=","revision":"a5b47d31c556af34a302ce5d659e6fea44d90de0","revisionTime":"2016-09-28T15:37:09Z"}
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make vendorfmt

@@ -162,7 +162,7 @@
{"path":"github.com/hashicorp/logutils","revision":"0dc08b1671f34c4250ce212759ebd880f743d883"},
{"path":"github.com/hashicorp/memberlist","checksumSHA1":"1zk7IeGClUqBo+Phsx89p7fQ/rQ=","revision":"23ad4b7d7b38496cd64c241dfd4c60b7794c254a","revisionTime":"2017-02-08T21:15:06Z"},
{"path":"github.com/hashicorp/net-rpc-msgpackrpc","revision":"a14192a58a694c123d8fe5481d4a4727d6ae82f3"},
{"path":"github.com/hashicorp/raft","checksumSHA1":"ecpaHOImbL/NaivWrUDUUe5461E=","revision":"3a6f3bdfe4fc69e300c6d122b1a92051af6f0b95","revisionTime":"2017-08-07T22:22:24Z"},
{"path":"github.com/hashicorp/raft","checksumSHA1":"zkA9uvbj1BdlveyqXpVTh1N6ers=","revision":"077966dbc90f342107eb723ec52fdb0463ec789b","revisionTime":"2018-01-17T20:29:25Z","version":"=master","versionExact":"master"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the version be just "master" not "=master"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point- this was from what I thought the govendor syntax needed to be. Fixed.

Copy link
Contributor

@dadgar dadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to changelog please

@chelseakomlo chelseakomlo merged commit 9d006ec into master Jan 23, 2018
@chelseakomlo chelseakomlo deleted the f-client-tls-reload branch January 23, 2018 10:51
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants