Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oidcc_provider_configuration_worker:get_provider_configuration/1 randomly times out #359

Closed
asabil opened this issue Jul 4, 2024 · 13 comments · Fixed by #361 or #365
Closed

oidcc_provider_configuration_worker:get_provider_configuration/1 randomly times out #359

asabil opened this issue Jul 4, 2024 · 13 comments · Fixed by #361 or #365
Assignees
Labels

Comments

@asabil
Copy link

asabil commented Jul 4, 2024

oidcc version

3.2.0

Erlang version

26.2.5

Elixir version

Summary

calls to oidcc_provider_configuration_worker:get_provider_configuration/1 randomly times out. It seems that this is somehow related to httpc timing out?

The following crash report is reported by cowboy:

=CRASH REPORT==== 4-Jul-2024::11:14:21.242933 ===
  crasher:
    initial call: cowboy_stream_h:request_process/3
    pid: <0.2178.0>
    registered_name: []
    exception exit: {{timeout,
                         {gen_server,call,
                             [<0.1461.0>,get_provider_configuration]}},
                     [{gen_server,call,2,[{file,"gen_server.erl"},{line,404}]},
                      {oidcc_client_context,from_configuration_worker,4,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_client_context.erl"},
                           {line,137}]},
                      {myapp_authn_user,authenticate_with_jwt,1,
                          [{file,
                               "/workspace/myapp-server/apps/myapp_authn/src/myapp_authn_user.erl"},
                           {line,17}]},
                      {myapp_http_token_h,from,3,
                          [{file,
                               "/workspace/myapp-server/apps/myapp_http/src/handlers/myapp_http_token_h.erl"},
                           {line,100}]},
                      {cowboy_rest,call,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,1590}]},
                      {cowboy_rest,process_content_type,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,1096}]},
                      {cowboy_rest,upgrade,4,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl"},
                           {line,284}]},
                      {cowboy_stream_h,execute,3,
                          [{file,
                               "/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_stream_h.erl"},
                           {line,306}]}]}
      in function  gen_server:call/2 (gen_server.erl, line 404)
      in call from oidcc_client_context:from_configuration_worker/4 (/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_client_context.erl, line 137)
      in call from myapp_authn_user:authenticate_with_jwt/1 (/workspace/myapp-server/apps/myapp_authn/src/myapp_authn_user.erl, line 17)
      in call from myapp_http_token_h:from/3 (/workspace/myapp-server/apps/myapp_http/src/handlers/myapp_http_token_h.erl, line 100)
      in call from cowboy_rest:call/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 1590)
      in call from cowboy_rest:process_content_type/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 1096)
      in call from cowboy_rest:upgrade/4 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_rest.erl, line 284)
      in call from cowboy_stream_h:execute/3 (/workspace/myapp-server/_build/default/lib/cowboy/src/cowboy_stream_h.erl, line 306)
    ancestors: [<0.2177.0>,<0.1475.0>,<0.1468.0>,<0.1467.0>,ranch_sup,
                  <0.1424.0>]
    message_queue_len: 0
    messages: []
    links: [<0.2177.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 28
    reductions: 2911
  neighbours:

Current behavior

Inspecting the worker state shows that it is hanging in httpc:handle_answer/3:

1> erlang:process_info(pid(0,1461,0)).
[{registered_name,myapp_authn_oidcc_auth0},
 {current_function,{httpc,handle_answer,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,139},
 {links,[<0.1460.0>]},
 {dictionary,[{rand_seed,{#{type => exsss,next => #Fun<rand.0.65977474>,bits => 58,
                            uniform => #Fun<rand.1.65977474>,
                            uniform_n => #Fun<rand.2.65977474>,
                            jump => #Fun<rand.3.65977474>},
                          [264808474144978743|268960576097479628]}},
              {'$initial_call',{oidcc_provider_configuration_worker,init,
                                                                    1}},
              {'$ancestors',[myapp_authn_sup,<0.1459.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1458.0>},
 {total_heap_size,24620},
 {heap_size,6772},
 {stack_size,65},
 {reductions,562812697},
 {garbage_collection,[{max_heap_size,#{error_logger => true,include_shared_binaries => false,
                                       kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,21}]},
 {suspending,[]}]

Also, it seems like the worker process has accumulated a set of timer events:

2> erlang:process_info(pid(0,1461,0), messages).
{messages,[jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired|...]}
3> io:format("~p~n", [erlang:process_info(pid(0,1461,0), messages)]).
{messages,[jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired,jwks_expired,jwks_expired,
           jwks_expired,jwks_expired,jwks_expired]}
ok

Checking the stacktrace:

4> erlang:process_info(pid(0,1461,0), current_stacktrace).
{current_stacktrace,[{httpc,handle_answer,3,
                            [{file,"httpc.erl"},{line,865}]},
                     {httpc,handle_request,9,[{file,"httpc.erl"},{line,809}]},
                     {oidcc_http_util,'-request/4-fun-0-',5,
                                      [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_http_util.erl"},
                                       {line,109}]},
                     {telemetry,span,3,
                                [{file,"/workspace/myapp-server/_build/default/lib/telemetry/src/telemetry.erl"},
                                 {line,321}]},
                     {oidcc_provider_configuration,load_jwks,2,
                                                   [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_provider_configuration.erl"},
                                                    {line,264}]},
                     {oidcc_provider_configuration_worker,handle_continue,2,
                                                          [{file,"/workspace/myapp-server/_build/default/lib/oidcc/src/oidcc_provider_configuration_worker.erl"},
                                                           {line,212}]},
                     {gen_server,try_handle_continue,3,
                                 [{file,"gen_server.erl"},{line,1085}]},
                     {gen_server,loop,7,[{file,"gen_server.erl"},{line,995}]}]}

How to reproduce

Haven't been able to pinpoint the exact scenario leading to the hang/timeout yet.

Expected behavior

It shouldn't hang :)

@asabil asabil added the bug label Jul 4, 2024
@maennchen
Copy link
Member

@asabil Hm, I possibly haven’t set any timeouts which will hang it forever. I‘m open to a PR or I’ll investigate this next week.

@asabil
Copy link
Author

asabil commented Jul 5, 2024

Added some printout in oidcc_http_util:request/4 and it actually looks like there is a "timer" problem:

[1720176242986], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration",[]} [{timeout,60000}] default
[1720176243470], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258475], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176258526], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176258647], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273530], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176273658], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273806], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176273852], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288659], HTTP: get {"https://myapp-dev.eu.auth0.com/.well-known/openid-configuration", []} [{timeout,60000}] default
[1720176288705], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288811], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288854], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default
[1720176288905], HTTP: get {<<"https://myapp-dev.eu.auth0.com/.well-known/jwks.json">>, []} [{timeout,60000}] default

So it looks like we somehow end up being blocked by auth0 because the number of jwks requests keep accumulating (1, 2, 3, 4...).

For reference, this is the print out code:

io:format("[~B], HTTP: ~p ~p ~p ~p~n", [
  os:system_time(millisecond),
  Method,
  Request,
  HttpOpts,
  HttpProfile
]),

@maennchen
Copy link
Member

Hm, are they setting a strange cache header that leads to continuous refreshing?

@asabil
Copy link
Author

asabil commented Jul 5, 2024

Looks like this helps at least with the accumlation of timers:

diff --git a/src/oidcc_provider_configuration_worker.erl b/src/oidcc_provider_configuration_worker.erl
index d4539ec..a9cefa1 100644
--- a/src/oidcc_provider_configuration_worker.erl
+++ b/src/oidcc_provider_configuration_worker.erl
@@ -224,7 +224,8 @@ handle_continue(
 %% @private
 handle_info(backoff_retry, State) ->
     {noreply, State, {continue, load_configuration}};
-handle_info(configuration_expired, State) ->
+handle_info(configuration_expired, #state{jwks_refresh_timer = JwksRefreshTimer} = State) ->
+    maybe_cancel_timer(JwksRefreshTimer),
     {noreply, State#state{configuration_refresh_timer = undefined, jwks_refresh_timer = undefined},
         {continue, load_configuration}};
 handle_info(jwks_expired, State) ->

That being said, from my own experience, I think that it is better to use erlang:start_timer instead of timer:send_after as the former allows for matching against the currently active timer and avoid creating "parallel" timers such as in this case.

For example, with the current code I think it is still possible to have a race condition where the timer being canceled has already been fired and the message corresponding to the timeout is in the mailbox but not yet processed.

@asabil
Copy link
Author

asabil commented Jul 5, 2024

Hm, are they setting a strange cache header that leads to continuous refreshing?

Yes, 15 seconds:

{"cache-control",
  "public, max-age=15, stale-while-revalidate=15, stale-if-error=86400"},

@maennchen
Copy link
Member

@asabil Sorry for the long wait.

I can't reproduce the issue. There should only be one single message triggered after the jwk reload is successful. You are seeing hundreds of messages in the queue.

Can you try to log each time handle_info or handle_continue is called together with the message_queue_len so that we can figure out what path those messages take so that they can accumulate?

@asabil
Copy link
Author

asabil commented Aug 5, 2024

Sorry for the delay, finally got to add some logs:

[ts: 1722866476154, qlen: 0] handle_continue(load_configuration)
[ts: 1722866476629, qlen: 0] handle_continue(load_jwks)

[ts: 1722866491632, qlen: 0] handle_info(configuration_expired)
[ts: 1722866491633, qlen: 0] handle_continue(load_configuration)
[ts: 1722866491662, qlen: 0] handle_continue(load_jwks)
[ts: 1722866491811, qlen: 0] handle_info(jwks_expired)
[ts: 1722866491811, qlen: 0] handle_continue(load_jwks)

[ts: 1722866506666, qlen: 0] handle_info(configuration_expired)
[ts: 1722866506667, qlen: 0] handle_continue(load_configuration)
[ts: 1722866506793, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506918, qlen: 1] handle_info(jwks_expired)
[ts: 1722866506918, qlen: 1] handle_continue(load_jwks)
[ts: 1722866506945, qlen: 0] handle_info(jwks_expired)
[ts: 1722866506945, qlen: 0] handle_continue(load_jwks)

[ts: 1722866521798, qlen: 0] handle_info(configuration_expired)
[ts: 1722866521799, qlen: 0] handle_continue(load_configuration)
[ts: 1722866521888, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521918, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521919, qlen: 0] handle_continue(load_jwks)
[ts: 1722866521946, qlen: 0] handle_info(jwks_expired)
[ts: 1722866521946, qlen: 0] handle_continue(load_jwks)
[ts: 1722866522394, qlen: 0] handle_info(jwks_expired)
[ts: 1722866522394, qlen: 0] handle_continue(load_jwks)

[ts: 1722866536889, qlen: 0] handle_info(configuration_expired)
[ts: 1722866536890, qlen: 0] handle_continue(load_configuration)
[ts: 1722866537015, qlen: 2] handle_continue(load_jwks)
[ts: 1722866537050, qlen: 1] handle_info(jwks_expired)
[ts: 1722866537051, qlen: 1] handle_continue(load_jwks)
[ts: 1722866537080, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537080, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537398, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537398, qlen: 0] handle_continue(load_jwks)
[ts: 1722866537510, qlen: 0] handle_info(jwks_expired)
[ts: 1722866537510, qlen: 0] handle_continue(load_jwks)

I added empty lines to group together the printouts. As you can see we somehow end up with accumulating timers that lead to excessive calls.

The qlen is 0, because we will only start to see that once auth0 (in my case) starts to apply rate limiting because of the excessive calls.

I still need to test the patch I posted above extensively, but it should solve the problem.

@asabil
Copy link
Author

asabil commented Aug 5, 2024

With the patch above applied, the logs look like this:

[ts: 1722867041017, qlen: 0] handle_continue(load_configuration)
[ts: 1722867041670, qlen: 0] handle_continue(load_jwks)

[ts: 1722867056676, qlen: 0] handle_info(configuration_expired)
[ts: 1722867056676, qlen: 0] handle_continue(load_configuration)
[ts: 1722867056704, qlen: 0] handle_continue(load_jwks)

[ts: 1722867071707, qlen: 0] handle_info(configuration_expired)
[ts: 1722867071708, qlen: 0] handle_continue(load_configuration)
[ts: 1722867071847, qlen: 0] handle_continue(load_jwks)

[ts: 1722867086849, qlen: 0] handle_info(configuration_expired)
[ts: 1722867086849, qlen: 0] handle_continue(load_configuration)
[ts: 1722867086887, qlen: 0] handle_continue(load_jwks)

@maennchen
Copy link
Member

@asabil That looks promising. I’ll go ahead and merge the PR then. Thanks for debugging.

@asabil
Copy link
Author

asabil commented Aug 5, 2024

Which PR?

@asabil
Copy link
Author

asabil commented Aug 5, 2024

Just looked at the PR, that will not work. The fix is the one I posted in the comments above.

jwks_refresh_timer is set to undefined in handle_info, which will cause the maybe_cancel_timer to be a no-op in handle_continue.

@maennchen
Copy link
Member

@asabil Oh, I misinterpreted your message. I'll revert the PR.

I believe you've pinpointed the exact issue though: The timer was cleared in the handle_info and so the timer cancellation can never work.

I'll revert and open a new PR to test.

@maennchen
Copy link
Member

@asabil Does the new PR work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants