Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit support for auth change during reconnect #1694

Closed
tommyjcarpenter opened this issue Aug 7, 2024 · 14 comments · Fixed by #1713
Closed

Explicit support for auth change during reconnect #1694

tommyjcarpenter opened this issue Aug 7, 2024 · 14 comments · Fixed by #1713
Assignees
Labels
proposal Enhancement idea or proposal

Comments

@tommyjcarpenter
Copy link

tommyjcarpenter commented Aug 7, 2024

Proposed change

I am experiencing "fuzzy" behavior regarding the following. Fuzzy as in, the code works when the NATS client is connected to a single URL, but it doesnt work when connected to multiple URLs.

We are using the NATS Auth Callout, but client side, the Password field needs to change each time the disconnect callback is called, which happens when the auth callout JWTs expire (eg hourly). Specifically, our autth callout backend is interpreting a "password" specific to our in house IAM stack, and those also expire.

Because our auth callout tokens expire every hour, our disconnectedCB is called every hour. And in that, we want to change conn.Opts.Password so that it uses the new password (the client has to regenerate one at this time) on the reconnection attempt (see below, the actual auth call is made after the reconnected callback, so "hook wise", we already have a hook at the right spot.)

My functions currently looks like this:

func (gb *Base) tokenAndConnect(ctx context.Context) (*nats.Conn, error) {
	iamCreds, err := gb.iamRuntime.GetAccessToken(ctx, &identity.GetAccessTokenRequest{})
	if err != nil {
		gb.Logger.Fatalf("failed to get access token: %v", err)
	}

	// For the auth callout, we have a sentinel and we also need an Identity token as our password
	return nats.Connect(
		gb.config.Nats.URL,
		nats.UserCredentials(gb.config.Nats.SentinelCredsFile),
		nats.UserInfo(gb.config.MsName, iamCreds.Token),
	)
}


func (gb *Base) AddNATSConn(ctx context.Context) *Base {
	nc, err := gb.tokenAndConnect(ctx)
	if err != nil {
		gb.Logger.Fatalf("failed to create Initial NATS connection: %v", err)
	}

	/*
		    Disconnect Callbacks are called when the function reconnects....
			https://github.com/nats-io/nats.go/blob/370bc4fc4cac2de8d5dbbddbb98739f9cb2d7f8b/nats.go#L2779-L2787
			The purpose of this is to get a new identityJWT before the reconnect happens

			We cant use the reconnectCB (misleading name here)
			ReconnectCB is called *after* the reconnection happens. In order to refresh the token before the reconnect,
			Doc strings for variouis callbacks:
			ReconnectedCB sets the reconnected handler called whenever the connection is successfully reconnected.
			ConnectedCB sets the connected handler called when the initial connection is established. It is not invoked on successful reconnects - for reconnections, use ReconnectedCB
			DisconnectedErrCB sets the disconnected error handler that is called whenever the connection is disconnected.
			ClosedCB sets the closed handler that is called when a client will no longer be connected.
	*/

	nc.SetDisconnectErrHandler(func(conn *nats.Conn, err error) {
		gb.Logger.Infof("NATS connection disconnected: %v. Generating new Identity JWT", err)
		iamCreds, err := gb.iamRuntime.GetAccessToken(ctx, &identity.GetAccessTokenRequest{})
		if err != nil {
			gb.Logger.Fatalf("Failed to get access token: %w", err)
		}
		
                 // *********************************!!!
                 // doesnt work in multi URL case but works fine when you seed the client with a single NATS server
		conn.Opts.Password = iamCreds.Token
	})

	nc.SetReconnectHandler(func(conn *nats.Conn) {
		gb.Logger.Infof("NATS connection reconnected. Reconnect count %d", conn.Reconnects)
	})

	gb.NatsConn = nc
	return gb
}

I think it has something to do with this warning which I just saw..

nats.go/nats.go

Lines 537 to 538 in e3df53d

// Modifying the configuration of a running Conn is a race.
Opts Options

It works in the case where the client only was given one cluster URL but not multiple.

I eventually tracked down this call chain on reconnects:

  1. the disconnect handler is called first:

    nats.go/nats.go

    Lines 2781 to 2787 in e3df53d

    if !nc.initc {
    if nc.Opts.DisconnectedErrCB != nil {
    nc.ach.push(func() { nc.Opts.DisconnectedErrCB(nc, err) })
    } else if nc.Opts.DisconnectedCB != nil {
    nc.ach.push(func() { nc.Opts.DisconnectedCB(nc) })
    }
    }
  2. later in that function, createConn is called though I dont think this is the problematic one:

    nats.go/nats.go

    Line 2873 in e3df53d

    err = nc.createConn()
  3. then later down in that function, nc.processConnectInit() is called:

    nats.go/nats.go

    Line 2886 in e3df53d

    if nc.err = nc.processConnectInit(); nc.err != nil {
  4. that calls sendConnect():

    nats.go/nats.go

    Line 2380 in e3df53d

    err = nc.sendConnect()
  5. that calls connectProto

    nats.go/nats.go

    Line 2644 in e3df53d

    cProto, err := nc.connectProto()
  6. that contains:

    nats.go/nats.go

    Lines 2551 to 2566 in e3df53d

    u := nc.current.url.User
    if u != nil {
    // if no password, assume username is authToken
    if _, ok := u.Password(); !ok {
    token = u.Username()
    } else {
    user = u.Username()
    pass, _ = u.Password()
    }
    } else {
    // Take from options (possibly all empty strings)
    user = o.User
    pass = o.Password
    token = o.Token
    nkey = o.Nkey
    }

that code makes it unclear when this Password change is going to be reflected.

My fear/suspicion is that, in the multiple URL case, that User is populated; triggering the if, which is why it sends the old token, which is why it breaks. In the single URL case, that is not populated, thus triggering the else

I am trying to see how I can explicitly force it to use a new password, either via setting some variable sure to be picked up, or via a callback. Maybe a flag that says "always override the auth from the connection options".

This block might be related, I was trying to find anywhere where nc.current.url is set:

nats.go/nats.go

Lines 1821 to 1827 in e3df53d

if implicit {
curl := nc.current.url
// Check to see if we do not have a url.User but current connected
// url does. If so copy over.
if u.User == nil && curl.User != nil {
u.User = curl.User
}

Use case

We are using the NATS Auth Callout, but client side, the Password field needs to change each time the disconnect callback is called, which happens when the auth callout JWTs expire (eg hourly). Specifically, our autth callout backend is interpreting a "password" specific to our in house IAM stack.

Contribution

No response

@tommyjcarpenter tommyjcarpenter added the proposal Enhancement idea or proposal label Aug 7, 2024
@tommyjcarpenter
Copy link
Author

tommyjcarpenter commented Sep 9, 2024

Good morning - I'd like to follow up here, as this issue is basically preventing several applications to adopt NATS in our organization.

In the meantime, our canary application is doing a complicated reset procedure that establishes a new connection, but its onerous and teams aren't going to adopt it. All we need is a way to update the password that the next connection attempt will use in the disconnectCB (which is again called on Auth Callout expiry).

@derekcollison
Copy link
Member

Looping in @Jarema and @piotrpio here. This sounds reasonable especially with auth callouts growing popularity and the desire for lower bounded authN timeframes.

@piotrpio piotrpio self-assigned this Sep 10, 2024
@tommyjcarpenter
Copy link
Author

tommyjcarpenter commented Sep 10, 2024

Awesome. A simple callback along the lines of DisconnectedCB (it can even be called from DisconnectedCB if there is no other signal that the AC expired) that allows resetting the password (our passwords are themselves expiring JWTs) would be perfect.

@piotrpio
Copy link
Collaborator

Hey @tommyjcarpenter. While I don't think manually changing Opts in DisconnectedCb is a good idea (it's racy as you pointed out and would set precedent for modifying other options dynamically, which we really suggest not to), there may be another straightforward solution.

We could expose a callback options similar to TokenHandler option (func TokenHandler(cb AuthTokenHandler) Option). This callback would be called each time the client attempts to reconnect and in this callback you could easily generate the new token. The signature would look something like this (names could probably be better):

func UserInfoHandler(cb AuthUserInfoHandler) Option {}

AuthUserInfoHandler would be a function returning (string, string) where you would generate the new password.

Would you be satisfied with such solution?

@tommyjcarpenter
Copy link
Author

@piotrpio I would be thrilled with that. As long as its called on each reconnect, which can happen either due to auth callout expiry, or just a connection drop. Our internal JWTs expire every few minutes, so this CB could be needed even on a normal drop.

Thank you so much, this will unblock a large part of our org from adopting this - given how our organization's IAM stack works, with expiring passwords, and given we'd like to set the auth callout JWTs to expire (IDK, Daily maybe), this is needed to bring it all together.

@tommyjcarpenter
Copy link
Author

tommyjcarpenter commented Sep 10, 2024

Just curious: implementation wise, where would this AuthUserInfoHandler "plug into" that isn't a race in the same way that changing connopts.password is?

Because I noticed we don't have a race when the client is initially seeded with only one NATS server, but only in cases where they are given a NATS url like nats://foo1:4222, nats://foo2:4222

this took me a lot of debugging before actually finding that race code comment 😅

@piotrpio
Copy link
Collaborator

There is a race because you are potentially changing Opts in DisconnectedCB while the reconnect is already triggered. Since DisconnectedCB is asynchronous and called in its own goroutine, when the client is reading the options during reconnect you have a read write contention. I'm not sure why it's only detected when using more than one server, but it should not be done either way.

With callback, the option is set once and executed in the same routine as the reconnect process (synchronously), so there is no risk of that. It'll be called somewhere here and the resulting user/password will be used in CONNECT protocol message directly.

@tommyjcarpenter
Copy link
Author

Awesome. Let me know if I can help with reviews or anything in any way.

@tommyjcarpenter
Copy link
Author

Great! When will a release be cut for this?

@piotrpio
Copy link
Collaborator

We have a few more things to add for the next release so I would say early net month is viable.

@tommyjcarpenter
Copy link
Author

@piotrpio im going to go get off this commit in the meantime, is this feature considered done/ready?

@tommyjcarpenter
Copy link
Author

tommyjcarpenter commented Sep 20, 2024

@piotrpio it worked beautifully 😭 thanks again

One thing I will note is that I didn't see a setter, I'm doing it like this:

// userInfoCB is a callback function for the NATS client to get the Identity JWT on auth callout JWT expiry
// see https://github.com/nats-io/nats.go/issues/1694
func (c *Canary) userInfoCB() (string, string) {
	c.logger.Infof("Getting Identity JWT from %s", socketpath)
	iamCreds, err := c.iamRuntime.GetAccessToken(c.iamCtxt, &identity.GetAccessTokenRequest{})
	if err != nil {
		c.logger.Fatalf("Failed to get access token: %v", err)
	}

	return "nats-canary", iamCreds.Token
}

// AddNATSConnAC adds a NATS connection using the IAM Runtime
func (c *Canary) AddNATSConnAC() *Canary {
	c.logger.Infof("Creating new NATS connection")

	newNC, err := nats.Connect(
		natsURL,
		nats.UserCredentials(sentinelFile),
		nats.UserInfoHandler(c.userInfoCB),
	)
	if err != nil {
		c.metrics.Collectors[NATSConnErrCounter].(*prometheus.CounterVec).With(prometheus.Labels{"kind": c.kind}).Inc()
		c.logger.Fatalf("Failed to create NATS connection: %v", err)
	}

	c.logger.Infof("New NATS connection established")
	c.metrics.Collectors[NATSConnEstablished].(*prometheus.CounterVec).With(prometheus.Labels{"kind": c.kind}).Inc()
	c.NatsConn = newNC
	c.addJetStream() // use the new connection to create the JetStream context
	return c
}

@tommyjcarpenter
Copy link
Author

Heyo - Is this going to be included in a release anytime soon? Thank you!

@piotrpio
Copy link
Collaborator

Hey! Yes, we'll be releasing the client this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants