Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor HTTP Server and add ServerGroup to handle graceful shutdown of multiple servers #1047

Merged
merged 7 commits into from
Mar 7, 2021

Conversation

JoelSpeed
Copy link
Member

@JoelSpeed JoelSpeed commented Feb 15, 2021

Description

This moves the http server implementation to its own package under pkg/http (not sure about the name).
It also moves the configuration to a structured config type which means we can reuse the config for the main app server and the metrics server (which will give the metrics server the ability to serve over TLS).

One change from before is that the listeners are established before Start is called, so errors are returned sooner in case of listener conflicts.

I've also introduced a ServerGroup concept to start and run/manage the lifecycle of multiple servers at once, so we can make sure everything starts and stops gracefully.

Motivation and Context

Wanted to simplify the usage of http server implementations and make sure we use a common implementation for both metrics and the main application.

How Has This Been Tested?

Local testing environment and unit tests

Checklist:

  • My change requires a change to the documentation or CHANGELOG.
  • I have updated the documentation/CHANGELOG accordingly.
  • I have created a feature (non-master) branch for my PR.
  • I have added appropriate comments throughout my code changes.
  • I have added appropriate tests for my code changes.

@JoelSpeed JoelSpeed added this to In Progress (Pull Requests) in Release v7.1.0 via automation Feb 15, 2021
// Start starts the HTTP and HTTPS server if applicable.
// It will block until the context is cancelled.
// If any errors occur, only the first error will be returned.
func (s *server) Start(ctx context.Context) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method server.Start has 5 return statements (exceeds 4 allowed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nonsense, the function has 1 return statement, the anonymous functions within have 4, false positive

// startServer creates and starts a new server with the given listener.
// When the given context is cancelled the server will be shutdown.
// If any errors occur, only the first error will be returned.
func (s *server) startServer(ctx context.Context, listener net.Listener) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method server.startServer has 5 return statements (exceeds 4 allowed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nonsense, the function has 1 return statement, the anonymous functions within have 4, false positive

// Server represents an HTTP or HTTPS server.
type Server interface {
// Start blocks and runs the server.
Start(ctx context.Context) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose Shutdown (or Close) on this interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I'm running with on this one is that the context being cancelled causes the shutdown, so to use it you do

ctx, cancel := context.WithCancel(context.Background())

go srv.Start(ctx)

// wait for signal
cancel()

If you think having a Close would help then I can look into adding that as well, would probably mean just moving the context inside the server itself implementation wise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotcha -- so this ctx in the Server Start is more for controlling the concurrency with the underlying uses of the errgroup. (rather than in something like http.Server which doesn't take it).

I think I had Serve + Shutdown/Close on my mind when I saw an interface named Server (and mistook Start for Serve).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's it, this is more of a Start with a stop channel style thing, except the stop channel is the context which can be cancelled

switch scheme {
case "", "http":
return "tcp"
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a common misconfiguration would be https somehow here -- Is that handled somewhere else since that wouldn't align to the net.Listener legal networks?

The network must be "tcp", "tcp4", "tcp6", "unix" or "unixpacket".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is coming straight over from our existing codebase - carry on 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise an interesting point, we should probably investigate, but yeah, this is lift and shift so not sure 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use this on the https listener, that is forced to tcp as that's the only one that makes sense, so we don't need to handle this case separately I don't think

})
}

return g.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if only 1 set of servers in the errgroup fails? This waits for all Goroutines to finish right?

I would think if either main or prometheus (or there HTTP vs HTTPS variant if both are stood up) fail, we want the whole server to fail so we aren't in a partially failed state? (then let whatever platform hosting restart mechanisms are in place to start the whole suite up again with everything?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as the first non-nil error is returned, the groupCtx is cancelled which causes the shut down of all of the servers we started. Checking the godoc for errgroup.WithContext as this is what I think it is supposed to do at least:

WithContext returns a new Group and an associated Context derived from ctx.

The derived Context is canceled the first time a function passed to Go returns a non-nil error or the first time Wait returns, whichever occurs first.


// NewServerGroup creates a new Server to start and gracefully stop a collection
// of Servers.
func NewServerGroup(servers ...Server) Server {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this concept 👍

On HTTP/2, gRPC - we've had a few questions on that front. Would this design open up us potentially supporting a /oauth2/auth handler variant for gRPC based subrequest architecture?

(I'll punt on the complexity of HTTP/2 to an upstream for now 😅 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so, as long as we can create a gRPC server that we can start/stop using the Server interface defined in server.go, I wanted to make sure the start/stop was expandable in case we need further servers in the future

@JoelSpeed JoelSpeed force-pushed the http-server branch 2 times, most recently from bedde6c to 1815523 Compare February 16, 2021 22:52
@JoelSpeed JoelSpeed marked this pull request as ready for review February 16, 2021 22:53
@JoelSpeed JoelSpeed requested a review from a team as a code owner February 16, 2021 22:53
@JoelSpeed JoelSpeed changed the title [Very WIP] Refactor HTTP Server and add ServerGroup to handle graceful shutdown of multiple servers Refactor HTTP Server and add ServerGroup to handle graceful shutdown of multiple servers Feb 16, 2021
// setupTLSListener sets the server TLS listener if the HTTPS server is enabled.
// The HTTPS server can be disabled by setting the SecureBindAddress to "-" or by
// leaving it empty.
func (s *server) setupTLSListener(opts Opts) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method server.setupTLSListener has 5 return statements (exceeds 4 allowed).

// To use the secure server you must configure a TLS certificate and key.
Server Server `json:"server,omitempty"`

// Server is used to configure the HTTP(S) server for metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Server is used to configure the HTTP(S) server for metrics.
// MetricsServer is used to configure the HTTP(S) server for metrics.

func legacyServerFlagset() *pflag.FlagSet {
flagSet := pflag.NewFlagSet("server", pflag.ExitOnError)

flagSet.String("metrics-address", "", "the address /metrics will be served on (e.g. \":9100\")")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add support for HTTPS metrics with the legacy flags?

I think both this PR & the initial metrics PR will go out together in the same release for the first time (v7.1.0 I imagine). So there's no backwards compatibility we need to preserve with the existing metrics flags.

TLS support seems handy if it isn't too burdensome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess distinct metrics tls file flags for solely the metrics endpoint won't help our config flag explosion problem 😨

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally happy to add them, should be relatively easy to add, what shall we call them 🤔
metrics-address, metrics-secure-address, metrics-tls-cert-file, metrics-tls-key-file?

It's not too bad in terms of code management now these are in the legacy sets (as they are grouped nicely)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for adding these!

}

var _ = BeforeSuite(func() {
By("Generating a self-signed cert for TLS tests", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very thorough TLS setup is very nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we will need this in other places in the code as well, I think we should probably make a test utility package that we can store helpers to do this kind of set up, that can be a follow up though as we need it

if s.listener != nil {
g.Go(func() error {
if err := s.startServer(groupCtx, s.listener); err != nil {
return fmt.Errorf("error starting insecure server: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this reminds me -- I'm tempted to remove support for our force-https HTTP->HTTPS redirect middleware for security reasons (since we don't have any HSTS headers set from our end to give that process the appropriate protection to prevent future MITM attacks).

But from a security profile in our current state, where do we end up if a user sets both these servers, but doesn't set the force-https flag? Will the whole server be served on HTTP -- and what negative consequences does that have?

Maybe I need to understand the userbase more -- but I don't see OAuth2 Proxy being the first proxy facing the internet. In my mind that should be something like nginx, which should be dealing with the HTTP->HTTPS redirects, HSTS settings (or the upstream apps do this...?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from a security profile in our current state, where do we end up if a user sets both these servers, but doesn't set the force-https flag? Will the whole server be served on HTTP -- and what negative consequences does that have?

To do this, they would have to use the alpha config for the moment. I think it's still useful to have the redirect even though we don't necessarily have the most secure HTTPS settings in that case, it's an easy way for users to enforce HTTPS without having to worry about how that's configured in another proxy.

Will the whole server be served on HTTP -- and what negative consequences does that have?

I don't think there's any more danger than someone just running it on HTTP anyway, I think it's up to the users to make sure it's secure

// tcpKeepAliveListener sets TCP keep-alive timeouts on accepted
// connections. It's used by so that dead TCP connections (e.g. closing laptop
// mid-download) eventually go away.
type tcpKeepAliveListener struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any historical context as to why we only wrap the TLS listener in this tcpKeepAliveListener implementation of net.Listener and not the standard listener?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I had a dig and there's no conversation about this when it was implemented. I have no idea why it is only used for one, perhaps keep alive is more common for TLS to prevent handshake renegotiation or something?

}
err = tc.SetKeepAlive(true)
if err != nil {
logger.Printf("Error setting Keep-Alive: %v", err)
Copy link
Member

@NickMeves NickMeves Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errorf? What do you think for this type of error where we don't stop processing flow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix.

Erm, I think this error is extremely rare, so we probably never really hit this case, I think this ok for now.
I have just copied this from the original location though so could be nonsense.

@JoelSpeed
Copy link
Member Author

Updated the broken bits and added some flags for the metrics server for TLS. Also realised I don't have tests for the legacy servers conversion, so will need to add those

@JoelSpeed JoelSpeed force-pushed the http-server branch 2 times, most recently from 1243547 to 7f6b698 Compare February 19, 2021 12:08
expectHTTPListener: false,
expectTLSListener: true,
}),
Entry("with an invalid bind address port", &newServerTableInput{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases are very comprehensive - Nice!

The only addition I could think of adding is the case where a bind port is already taken.

NickMeves
NickMeves previously approved these changes Feb 21, 2021
Copy link
Member

@NickMeves NickMeves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Really good stuff - I feel like I get a little better at Go every time I review your code 😄

@JoelSpeed
Copy link
Member Author

@NickMeves I think this is ready to go now. I've rebased on latest master (no conflicts) and added tests for the legacy server conversion, that should be the only new part since your last review, PTAL

Copy link
Member

@NickMeves NickMeves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT! I'm excited for these changes + new metrics support in v7.1. I will definitely be rolling it out everywhere throughout Greenhouse and plugging in Prometheus ASAP 👍

@JoelSpeed JoelSpeed merged commit 6894738 into master Mar 7, 2021
Release v7.1.0 automation moved this from In Progress (Pull Requests) to Done Mar 7, 2021
@JoelSpeed JoelSpeed deleted the http-server branch April 1, 2021 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants