Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-833537 Refactor auth to make it ready for retries #836

Closed
wants to merge 1 commit into from

Conversation

sfc-gh-pfus
Copy link
Collaborator

@sfc-gh-pfus sfc-gh-pfus commented Jun 29, 2023

Description

Please explain the changes you made here.

Checklist

  • Code compiles correctly
  • Run make fmt to fix inconsistent formats
  • Run make lint to get lint errors and fix all of them
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

@sfc-gh-pfus sfc-gh-pfus requested a review from a team as a code owner June 29, 2023 07:17
@@ -440,6 +378,101 @@ func authenticate(
return &respd.Data, nil
}

func tryAuth(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function builds auth request data - for instance creates JWT token. We can retry this function invocation for specific error codes which will cause token recreation (with new expiration).

auth.go Outdated Show resolved Hide resolved
auth.go Show resolved Hide resolved
auth.go Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Jun 29, 2023

Hi @sfc-gh-pfus,
I think we need much deeper refactoring to allow renew JWT token for each retry attempt.
For authentication request the call stack is:
postAuth()->postRestful()->newRetryHTTP(). The retry behavior is in retryHTTP.execute() in retry.go.
The root case of JWT expire issue https://github.com/snowflakedb/snowflake-sdks-drivers-issues-teamwork/issues/443 is that, the retry attempt is made in lower layer (in retry.go) that applies to all http requests, while for key pair authentication the JWT token already expired during the retry or doesn't have much time left for server side to finish the authentication.
To fix the issue we need to make the retry on key pair auth in a much shorter interval (10 seconds by default), so when server get the JWT token it has enough time to finish the authentication. At the mean time, we also need to keep all existing retry/timeout behavior (login timeout, retry on retry-able errors, etc., basically all the logic in retryHTTP.execute() in retry.go) unchanged.

To do that, we could bring the retry logic in retryHTTP.execute() to higher layer (like what you did in auth.go), and simply make retryHTTP.execute() to retry in the shorter interval (10 seconds by default). But in this way we would have the existing retry logic implemented in two places and have to change both of them in case we need further change on retry in the future.
I think what other drivers (like ODBC and JDBC) did was having a shorter timeout specifically for key pair auth, in lower layer (like retry.go) when reach that timeout keep the retry context and allow to renew (through callback or something) the request in higher layer and then continue with the existing retry context. In such way we don't have to duplicate the exiting retry logic.

@sfc-gh-pfus
Copy link
Collaborator Author

Hi @sfc-gh-ext-simba-hx , thank you very much for such deep explanation!
However, there is a catch. Current retry is based on status codes, not on the body content - body is decoded by caller, not by retry algorithm. And the catch is that when token is invalid, Snowflake returns 200, and the low-level retry does not work.

I'm not keen to change retry behaviour on this low level, as there are two drawbacks:

  1. To retry on JWT is invalid error, we would have to move body reading/decoding to retry algorithm, which IMHO is not a good idea - only client knows how to decode body and this may be too large.
  2. We would need to move creating a token somehow into retry algorithm which sounds terrible :)

I think, that having my solution - another level of retrying is a lesser of two evils. What do you think?

@ghost
Copy link

ghost commented Jun 30, 2023

Hi @sfc-gh-pfus
Jut in case you might not aware of that, this is an existing issue that had been brought up two years ago, SNOW-372033, related ODBC PR https://github.com/snowflakedb/snowflake-odbc/pull/328, and the solution had been discussed at that time already.
Although I don't have the access but I think you would find more details in SNOW-372033.
For my understanding the solution is not adding retry on JWT is invalid error, that would be too late, since it waste time on handling authentication with a not fresh enough JWT token that could expire during the authentication.
The solution was before getting JWT is invalid error, resend the auth request with a refreshed JWT token (by default per 10 seconds), so the server side would have enough time to finish the authentication before JWT token expired.
At the mean time, the existing timeout/retry settings/mechanism should keep unchanged, otherwise it would be a breaking change.

@sfc-gh-pfus
Copy link
Collaborator Author

Thanks @sfc-gh-ext-simba-hx for answering!
Let me summarise what I understand so far.

  1. There is a keypair auth, which is based on transport layer on JWT. This JWT has an exp claim, which is calculated as now + cfg.JWTExpireTimeout, and currently default value for JWTExpireTimeout is 60 seconds - which means, that auth request must be completed within one minute, otherwise the auth fails.
  2. JWTExpireTimeout is a bit confusing name, because it's not the real timeout (like the timeout after which the request is interrupted) - it's just the token validity duration limit.
  3. We have following timeouts taking part in the scenario:
    a) ClientTimeout (default 15min) - total time for handling request - like connecting, waiting for server to respond etc. This is the only timeout set on http.Client - which means this is the only value that can cause interruption.
    b) LoginTimeout (default 1min) - this is a bit strange for me. Based on debugging this is the timeout, within all retries should complete - but only counting sleeping between calls (see retry function and the totalTimeout variable). It's a time during which subsequent retries are performed. When it passes, retrying stops and the request fails. This timeout value is not passed to http.Client and not cause interruption in any way.
  4. When SF handles JWT auth after exp, it returns JWT token is invalid, but responds with 200. Retry algorithm decides that it's ok, because it returned 200. Retry mechanism does not read body, only reads HTTP status code.
  5. When retry on HTTP level occurs, token is not recreated, so if retry occurs expired token is sent.

So current issue covers the situation:

  1. Client wants to auth using JWT token.
  2. Token is created with on time t0, with expiration set to t0 + 60s.
  3. Code executes retry HTTP algorithm and sends the first attempt of auth.
  4. As ClientTimeout is set to 15min, Snowflake can handle our request after t0 + 60s and responds with JWT token is invalid.
  5. Snowflake returns this as 200, so we do not perform retry on retry level.

Can you confirm my understanding so far?

If I understand everything correctly, I can recommend the following points:

  1. We can add another timeout for JWT token auth, which is shorter than 15min - we can set it to 10s. Unfortunately, this timeout is http.Client wide setting, so we need to have another http.Client just for this case (but it's not terrible IMHO).
  2. We can refactor retry code a bit in a following manner. We can add similar functions to existing ones, but which not receive a body as a parameter, but receives function that creates body. So now auth layer would build a function that creates a fresh token every time it is invoked and retry could recreate token when needed - token building logic stays in auth layer, retry algorithm is clean, so far so good. Connected to the timeout from the previous point - we would recreate and resend a new token every 10s.
  3. We can refactor retry code similar to the previous point, but retry can also take function/predicate deciding if retry should happen - based on the HTTP response (of course current retry logic will be held). So in this attempt, auth would decide that if auth error is caused by invalid token, retry should occur. The only drawback is that in such situation we would need to read response body in retry algorithm - but I think I can write the code in a way, that it would work (yes, I'm aware that body should be read only once).

What do you think about my proposal?

@ghost
Copy link

ghost commented Jul 4, 2023

@sfc-gh-pfus
I think we have the same understanding of this issue.
For your recommendation 1, 2 make sense to me. I don't know any other driver does 3 though so I think that might be too much. As I mentioned this issue has been discussed 2 years ago crossing all drivers and all other Snowflake drivers have fixed this issue already. Somehow only go driver got left behind at that time so I think it's better to follow the decision made before for all other drivers and not to bring new approach.
BTW @sfc-gh-ext-simba-jl is also working on https://github.com/snowflakedb/snowflake-sdks-drivers-issues-teamwork/issues/443 please sync with her through slack or in the ticket to avoid duplicated efforts.

@sfc-gh-ext-simba-jl
Copy link
Collaborator

@sfc-gh-pfus

I was thinking to add a go routine timer inside retry.go execute() so the select case will notify when the JWT renew time is up (every 10s) and renew the token before the next POST request (of course this will be restricted to JWT auth request).

It will look like the following:

for {
    ch := make(chan string)
    if totalTimeout > 0 {
        if hasJwtExpired {
            // renew JWT token here and update the request body
        }
        
        go jwtTimer(JWTRenewTimeout, ch)
    } else {
        break
    }
    
    ... // existing logic for retryCounter and totalTimeout etc.
    
    select {
    case ... // existing cases
    case res := <- ch:
        hasJwtExpired = true
        // loop back to the beginning of the for loop to renew JWT token
}
func jwtTimer(JWTRenewTimeout time.Duration, ch chan string) {
	time.Sleep(JWTRenewTimeout)
	ch <- "Time to renew the JWT token"
	close(ch)
}

I was just playing the above code in a simple test app and haven't tested in the driver yet so not sure if this will work or not. But I'm not sure if a separate http.Client is needed. Did I simplify the problem too much? Also is it too risky to do this in retry.go?

Please let me know your thoughts.

Thanks,
Joyce

@sfc-gh-pfus
Copy link
Collaborator Author

Hi @sfc-gh-ext-simba-jl ! Good to have someone also working on this :)
I'd still choose may solution. There are several arguments:

  1. Your code introduces some JWT logic to retry algorithm, which is request independent.
  2. We have to play with another thread, which is always error-prone - even if it works now it's easier for someone to break it in the future.
  3. My solution is (in my opinion of course!) more readable and scalable - it would be easier to add more such corner cases in "clean way".

I started working on this yesterday afternoon and looks promising. Today I have busy time, but I hope I will post first version so you could take a look and we could decide if it's good enough. What do you think? Did I convince you? :)

@sfc-gh-pfus
Copy link
Collaborator Author

Fixed in other PRs.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 19, 2023
@sfc-gh-pfus sfc-gh-pfus deleted the SNOW-833537-keypair-retry branch October 17, 2023 08:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants