Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled :erpc failures #140

Closed
mjquinlan2000 opened this issue Oct 26, 2021 · 6 comments
Closed

Unhandled :erpc failures #140

mjquinlan2000 opened this issue Oct 26, 2021 · 6 comments

Comments

@mjquinlan2000
Copy link

mjquinlan2000 commented Oct 26, 2021

I am getting a lot of errors while using the Nebulex.Caching decorators which is caused by an unhandled case clause in Nebulex.RPC.rpc_call/6 on rpc.ex:141.

    defp rpc_call(_supervisor, node, mod, fun, args, timeout) do
      :erpc.call(node, mod, fun, args, timeout)
    rescue
      e in ErlangError ->
        case e.original do # Line:141 -> fails here
          {:exception, original, _} when is_struct(original) ->
            reraise original, __STACKTRACE__

          {:exception, original, _} ->
            :erlang.raise(:error, original, __STACKTRACE__)
        end
    end

The clause that does not match is {:erpc, :timeout} and some other {:erpc, _error} exceptions. Here is some background about how the app is set up:

erlang 24.1
elixir 1.12.3
running on k8s cluster with a horizontal pod autoscaler (pods created or destroyed dynamically depending on resource usage)
Nebulex set up running Nebulex.Adapters.Partitioned
Node clustering done with libcluster over kubernetes metadata API. Polling interval is set at 10s (I might lower this to see if that helps).

I think the main solution is to handle the {:erpc, _error} clause in a better way, but I'm having a hard time understanding the :timeout issue when reading through the erlang documentation. I suspect it has something to do with when the system brings k8s pods down and nebulex is referencing a node that no longer exists.

A secondary concern that I have is when I use the caching decorators there are a number of errors that can arise from not being able to contact other nodes in the cluster and how those errors are handled. It seems to me that right now an error is raised, but this can and has been causing issues with some api requests. Is there a way that I could configure the cache to treat these errors as a "cache miss" so that I can fetch data directly from the data store on an exception without halting the execution? Is there a way for me to set the :erpc timeout explicitly? Also, might my cache configuration be causing issues with this as well?:

# cache module
defmodule MyApp.Cache do
  use Nebulex.Cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Partitioned
end

# child spec in application.ex
[{MyApp.Cache, []}, # ... other children]

# config.exs
config :my_app, MyApp.Cache,
  primary: [
    gc_interval: :timer.hours(1),
    backend: :shards
  ]

I'm not sure if a multi-level cache would help with this problem (and I probably haven't given this part of the app enough love or done nearly enough research on caching strategies). Any help here would be greatly appreciated.

Thank you

@cabol
Copy link
Owner

cabol commented Oct 27, 2021

Hey!

The clause that does not match is {:erpc, :timeout} and some other {:erpc, _error} exceptions
I think the main solution is to handle the {:erpc, _error} clause in a better way

Yeah, very good catch, I already pushed a fix for it, now the {:erpc, reason} case is handled.

but I'm having a hard time understanding the :timeout issue when reading through the erlang documentation. I suspect it has something to do with when the system brings k8s pods down and nebulex is referencing a node that no longer exists.

Well, not sure, because according to the documentation, if the node was down, the error should be {:erpc, : noconnection} instead of {:erpc, :timeout}. Is this happens exactly when the system brings k8s pods down?

A secondary concern that I have is when I use the caching decorators there are a number of errors that can arise from not being able to contact other nodes in the cluster and how those errors are handled. It seems to me that right now an error is raised, but this can and has been causing issues with some api requests. Is there a way that I could configure the cache to treat these errors as a "cache miss" so that I can fetch data directly from the data store on an exception without halting the execution? Is there a way for me to set the :erpc timeout explicitly? Also, might my cache configuration be causing issues with this as well?:

Right, actually I've been working on a feature to allow ignoring the errors when using the annotations, something on_error: raise (the default) or on_error: nothing (in this case the function logic is executed and the cache errors ignored). It will be available very soon, most likely this weekend. But yes, I agree, this is a needed and requested feature.

I'm not sure if a multi-level cache would help with this problem (and I probably haven't given this part of the app enough love or done nearly enough research on caching strategies). Any help here would be greatly appreciated.

Well, if you need a distributed/partitioned cache and you don't want to deal with Elixir/Erlang cluster, you can use the Redis adapter, and yes, maybe you can set up a multi-level topology with the local adapter as L1 and the Redis adapter as L2, and you can have a Redis cluster, etc. See the Redis example for more information.

Stay tuned!

@mjquinlan2000
Copy link
Author

mjquinlan2000 commented Oct 27, 2021

@cabol Thank you for the quick response.

Upon further investigation, I am getting a lot of {:erpc, :noconnection} errors too. I can't really seem to figure out the scenarios that produce either outcome, but I will keep monitoring it. I pushed a patch for the servers yesterday to use a multilevel cache with L1 being Nebulex.Adapters.Local and L2 being Nebulex.Adapters.Partitioned. This seemed to cut down on the errors quite a bit, but I'm still seeing them. I did notice that some pods crashed too and I'm going to try and fix that to see if this is related.

An option to ignore exceptions when using the cache decorators would be immensely helpful.

I am looking at the redis example and it looks like a possible solution (we have a redis cluster in our k8s env already for other stuff). I was concerned about the data type issue when using erlang terms as keys and/or values, but I see that the adapter handles that. I will take a look at that sometime soon.

Thanks again for the response and I'll keep you updated if I find anything. I look forward to the new features!

@cabol
Copy link
Owner

cabol commented Oct 29, 2021

@mjquinlan2000 just pushed the new feature for caching decorators (#141), please try it out, let me know your thoughts.

@cabol
Copy link
Owner

cabol commented Oct 31, 2021

Hey @mjquinlan2000, some thoughts about how you can handle these errors.

  1. I think if you are using the caching annotations, with the fixed I posted you can at least ignore them, but not sure how ignoring the errors will affect your application, maybe it is not an issue for you, since if something is not working with the partitioned cache because of an issue with the Elixir cluster, the function block is executed anyways, and once the issue is fixed, the cache will continue working, anyway, let le know if this works for you.

  2. Another option I've found very nice is using retry, it is a very nice library that allows you to handle scenarios like this one. For example, instead of ignoring the errors, perhaps you can retry for some period of time, and as a fallback just return some default value. You can use retry in several ways, let me give you a couple of examples:

If you are using/calling the cache directly (without annotations/decorators), you can wrap the logic like:

retry with: constant_backoff(1000) |> Stream.take(10), rescue_only: [Nebulex.RPCError] do
  MyPartitionedCache.get(key)
after
  result -> result
else
  _error -> nil
end

Or you can use the @retry annotation too:

@retry with: constant_backoff(1000) |> Stream.take(10), rescue_only: [Nebulex.RPCError]
def some_function(attrs) do
  values = MyPartitionedCache.get(attrs[:id]) # just an example
  # rest of the logic ...
end

And you can also use @retry and the caching annotation together, and have something like:

@retry with: constant_backoff(1000) |> Stream.take(10), rescue_only: [Nebulex.RPCError]
@decorate cacheable(cache: MyPartitionedCache, key: attrs.id)
def some_function(attrs) do
  # your logic ...
end

I think in this way you can control the RPC error in a better way, even without ignoring them. And the good thing about the retry's approach is that you can use it also with the caching annotations. Anyway, you can try it out, let me know if maybe this works also for you.

Also, as a heads-up, Nebulex v3 is ongoing, and one of the main features is the new API, it will provide an Ok/Error tuple API too (aside from the ! functions), giving the user the chance to handle the errors. But this is more a mid-term solution because the first release candidate is scheduled by end of January next year.

Wrapping up, try the alternatives out, and let me know your thoughts, I stay tuned!

Thanks!

@mjquinlan2000
Copy link
Author

@cabol I have seen the retry library before, but I have never used it. Most of cache errors that are concerning for me happen on user requests and I'd rather just ignore the error raised and fall back to the db so that I don't keep users waiting. I think the retry library will actually help me with some other areas in the application so I'm going to look into using it.

After more investigation, it's plain to see that something is causing the pods to crash in k8s and this abrupt termination is what causes the :erpc errors. In all other cases, the Erlang nodes are shut down gracefully and caching is handled correctly. Since the implementation of the mutlilevel cache with local and partitioned was put in, I am only getting a small amount of errors so I am going to try and fix the issue with pods crashing first.

I have not had a chance to look into the new decorator implementation and I am slated to be doing other work for the entire month of November, but it is on my list and I'll try to sneak some testing in.

Thanks

@cabol
Copy link
Owner

cabol commented Nov 2, 2021

Sounds good, in the meantime, I will close the issue, but feel free to reopen it and even create a new one if you come across something not working properly, stay tuned, and thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants