Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting name for Replicated Cache returns no node #101

Closed
gaiabeatrice opened this issue Feb 3, 2021 · 17 comments
Closed

Setting name for Replicated Cache returns no node #101

gaiabeatrice opened this issue Feb 3, 2021 · 17 comments

Comments

@gaiabeatrice
Copy link

Hello, this is one of my favorite elixir libraries!

I am having one issue: (running 2.0.0-rc.2)

I have a cache that looks like this:

defmodule MyApp.Caches.MyCache do
  use Nebulex.Cache,
    default_dynamic_cache: :my_cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

I tried set the name in the children list of the Application module

children = [
    # ...
    {MyApp.Caches.MyCache, [name: :my_cache]}
    # ...
]

I also have a test helper for async testing that looks like this (I call this in the setup of the tests)

  defmacro setup_with_dynamic_cache(cache) do
    quote do
      setup do
        cache = unquote(cache)

        default_dynamic_cache = cache.get_dynamic_cache()

        {:ok, pid} = cache.start_link()
        _ = cache.put_dynamic_cache(pid)

        on_exit(fn ->
          cache.put_dynamic_cache(default_dynamic_cache)
        end)

        {:ok, cache: cache, cache_pid: pid}
      end
    end
  end

If I run this, the tests pass and the app runs. However if I try to get the nodes I get an empty list.

MyCache.nodes()
# returns []

This is causing errors in production when some cron jobs are running saying

could not lookup Nebulex cache MyApp.Caches.MyCache.Primary because it was not started or it does not exist

I tried making some changes, for example:

defmodule MyApp.Caches.MyCache do
  use Nebulex.Cache,
    default_dynamic_cache: __MODULE__,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

and removing the name from the children list in Application

children = [
    # ...
    MyApp.Caches.MyCache,
    # ...
]

and I can now see the nodes running, however my tests are failing

** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.3318.0>}}

I also tried setting the name in my configuration but I am still having errors of the nodes not showing up.

What is the best way to navigate this?

@cabol
Copy link
Owner

cabol commented Feb 3, 2021

Hi @gaiabeatrice, glad to know you use and you like the project :)

About the error, when you give a name to a cache, internally the adapter uses that name to create and name the cluster and/or group along with other things (the module cannot be used because it is supposed the same cache module may be used for different cache instances, that's why the name is used internally instead of the cache module). So, when you call MyCache.nodes() it uses the default dynamic cache which is the module name, and this cases the error because it is not how you name it. Instead, you should call:

MyCache.nodes(:mycache) #=> you can pass the name or the pid

Or you can also set the default in the cache definition, like so:

defmodule MyApp.Caches.MyCache do
  use Nebulex.Cache,
    default_dynamic_cache: :my_cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

Since :my_cache was the name you use to start the cache, it is the use you have to use in the previous two options.

Another option is, don't pass the name, you can just start the cache as:

children = [
    # ...
    {MyApp.Caches.MyCache, []}
    # ...
]

And use the dynamic cache for the tests as you are doing it, that means, in production, you go for the default definition and setup, so that you can do MyCache.nodes(), and the tests since you are using a dynamic cache, there shouldn't be an issue.

I'd rather the second option.

Please try it out and let me know if that works for you, stay tuned!

@gaiabeatrice
Copy link
Author

Thank you so much for the prompt response.
I went with the second option. The errors on cron jobs triggering seem to have gone away too. I am still monitoring to make sure it is working consistently.

I think my tests failing were related with the fact that two caches were involved in the module. I added __MODULE__ to the macro and I don't get that {:error, {:already_started, #PID<xxx>}} anymore. For some reason having the name in the Application module was not causing it to happen.

@cabol
Copy link
Owner

cabol commented Feb 3, 2021

Thank you so much for the prompt response.
I went with the second option. The errors on cron jobs triggering seem to have gone away too. I am still monitoring to make sure it is working consistently.

Glad to hear that!

I think my tests failing were related with the fact that two caches were involved in the module. I added MODULE to the macro and I don't get that {:error, {:already_started, #PID}} anymore. For some reason having the name in the Application module was not causing it to happen.

Yeah, maybe there is a conflict with the names, I would need more info about how are you doing the tests for example, if you can provide something similar to see what do you have, how you are running the tests, perhaps I could help you more. Also, Nebulext tests are done in that way, using dynamic caches, you can maybe use it as an example.

Anyway, let me know if there anything I can help you with, I'm glad to help 😄 !

@gaiabeatrice
Copy link
Author

I am still having production errors, and I think it probably was unrelated with the name after all.

Everyday at the same time I get this error

(RuntimeError could not lookup Nebulex cache MyApp.Caches.MyCache.Primary because it was not started or it does not exist)

for my caches.

Yesterday I started getting this error as well:

GenServer MyApp.Caches.MyCache.Bootstrap terminating
** (MatchError) no match of right hand side value: {[:ok, :ok, :ok], [:"<serveraddress>"]}
    (nebulex 2.0.0-rc.2) lib/nebulex/adapters/replicated.ex:587: Nebulex.Adapters.Replicated.Bootstrap.maybe_run_on_nodes/3
    (nebulex 2.0.0-rc.2) lib/nebulex/adapters/replicated.ex:516: Nebulex.Adapters.Replicated.Bootstrap.handle_info/2
    (nebulex 2.0.0-rc.2) lib/nebulex/adapters/replicated.ex:527: Nebulex.Adapters.Replicated.Bootstrap.handle_info/2
    (stdlib 3.14) gen_server.erl:689: :gen_server.try_dispatch/4
    (stdlib 3.14) gen_server.erl:765: :gen_server.handle_msg/6
    (stdlib 3.14) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

All of the caches have a similar configuration, so for the sake of clarity this is one

config :myapp, MyApp.Caches.MyCache,
  primary: [
    gc_interval: :timer.hours(24)
  ],
  stats: true

and this is the cache

defmodule MyApp.Caches.MyCache do
  use Nebulex.Cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated,
    primary_storage_adapter: Nebulex.Adapters.Local
end

If I ssh to the cluster I can see that all the caches seem to be running and have entries.

@cabol
Copy link
Owner

cabol commented Feb 5, 2021

First, about this error: ** (MatchError) no match of right hand side value: {[:ok, :ok, :ok], [:"<serveraddress>"]}

When the replicated cache started, it tries to synchronize with the other nodes, and as far as I can see, three nodes reply :ok but this one :"<serveraddress>" fails, like if the cache were not running there, also the name looks very weird, like a soft-name, maybe local, not sure.

Regarding this, I assume you have 4 nodes, is that right? For example, what are the names of the other three nodes?

If I ssh to the cluster I can see that all the caches seem to be running and have entries.

Out of curiosity, if you do ssh to the cluster, what nodes do you get if you run MyCache.nodes()?

@cabol
Copy link
Owner

cabol commented Feb 7, 2021

@gaiabeatrice I found the issue, during the sync-up process the replicated adapter was is the cache name, assuming always the default, therefore, dynamic caches will fail. I'll push a fix today, keep you posted!

@cabol
Copy link
Owner

cabol commented Feb 7, 2021

@gaiabeatrice I pushed a fix in the master branch, may you try it out and let me know if it actually solves the error you currently have? Given the case you still got the error, we should get a more specific error message and see where the problem is. Stay tuned!

@cabol
Copy link
Owner

cabol commented Feb 11, 2021

@gaiabeatrice any update?

@gaiabeatrice
Copy link
Author

Hi, I am sorry.
But for some reason my Github notifications were turned off.

My errors stopped appearing a few days ago. I didn't use the latest master, but I am still using 2.0.0-rc.2. I am not sure if something else was creating problems and when it got fixed, it fixed this too.

I will absolutely try your fix if the error presents itself again.

I am really impressed by how responsive you are! This is one of my favorite libraries!

@cabol
Copy link
Owner

cabol commented Feb 12, 2021

Glad to hear that! I will close the issue, but please feel free to re-open it if even with the fix the error shows up again. Thanks!!

@cabol cabol closed this as completed Feb 12, 2021
@dwmcc
Copy link

dwmcc commented Feb 14, 2021

Hi @cabol - thanks for a great library!

I seem to also be running into a similar issue with the Replicated adapter, even when running from the latest master commit.
I've got two iex sessions running on my local machine:
iex --name node1@127.0.0.1 --cookie cache -S mix and iex --name node2@127.0.0.1 --cookie cache -S mix

I am using libcluster with the EPMD strategy, so all I specify in the config is the names of the two hosts, node1@127.0.0.1 and node2@127.0.0.1 and they'll form a cluster.

The module for my replicated cache is pretty barebones:

defmodule MyApp.Cache.Replicated do
  use Nebulex.Cache,
    otp_app: :my_app,
    adapter: Nebulex.Adapters.Replicated
end

And for config:

config :my_app, MyApp.Cache.Replicated,
  primary: [
    backend: :shards,
  ]

After starting the iex sessions, I can confirm that both nodes see each other. "Putting" a value in the replicated cache on one node and then "getting" the value from the second node works as expected. Additionally, I've confirmed with:

iex(node1@127.0.0.1)> MyApp.Cache.Replicated.nodes()
[:"node1@127.0.0.1", :"node2@127.0.0.1"]

However, terminating node1 (I am running :init.stop()) while simultaneously running a "take" command on node2 for a value that was previously set on node1 results in an error:

iex(node2@127.0.01)> MyApp.Cache.Replicated.take("some-key")

** (Nebulex.RPCMultiCallError) RPC error executing action: take

Errors:

[
  {{:error,
    {:exception,
     %RuntimeError{
       message: "could not lookup Nebulex cache MyApp.Cache.Replicated.Primary because it was not started or it does not exist"
     },
     [
       {Nebulex.Cache.Registry, :lookup, 1,
        [file: 'lib/nebulex/cache/registry.ex', line: 23]},
       {Nebulex.Adapter, :with_meta, 2,
        [file: 'lib/nebulex/adapter.ex', line: 37]},
       {Nebulex.Adapters.Replicated, :with_dynamic_cache, 3, []}
     ]}}, :"node1@127.0.0.1"}
]

    (nebulex 2.0.0-rc.2) lib/nebulex/adapters/replicated.ex:468: Nebulex.Adapters.Replicated.handle_rpc_multi_call/2
    (nebulex 2.0.0-rc.2) lib/nebulex/adapters/replicated.ex:184: Nebulex.Adapters.Replicated.do_transaction/6
    (kernel 7.1) global.erl:425: :global.trans/4
    (myapp 0.1.1) [...]
    (stdlib 3.13.2) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13.2) gen_server.erl:431: :gen_server.loop/7
    (stdlib 3.13.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

As you can see, it's the same error @gaiabeatrice posted earlier in the thread.
Notice the error specifically cites node1, so I'm assuming the replicated adapter is trying to run the RPC call on node1 which is at this point inaccessible?

Running MyApp.Cache.Replicated.nodes() directly before the take on node2 seems to confirm this suspicion as it returns [:"node1@127.0.0.1", :"node2@127.0.0.1"].

To further corroborate this, the following code, run at the same time, doesn't error and returns the expected value:

MyApp.Cache.Replicated.transaction [nodes: [node()]], fn ->
  state = MyApp.Cache.Replicated.take("some-key")
  Logger.debug("cached value is #{inspect(state)}")
end

For context, I'm trying to actively hand-off some state from a terminating node, so running the init.stop() both triggers a value to be saved on node1 and then causes node2 to "take" the value. Sometimes the init.stop() can take several seconds to wait for all processes to shut down. Maybe this latency while terminating causes the issue as node2 thinks node1 is still alive?

I can continue simply wrapping it in the transaction per the snippet above, but I figured it wasn't the desired behavior from the caching side.

Thanks for any thoughts!

@cabol cabol reopened this Feb 14, 2021
@cabol
Copy link
Owner

cabol commented Feb 14, 2021

Hey @dwmcc !! Certainly, it is the same error, but as far as I can see, the cause is different. Perhaps the real cause of the error is the one you are mentioning, which makes a lot of sense. The initial issue was about using the replicated adapter with dynamic caches, and that was the fix. But this is a different thing that hasn't to do with the dynamic caches and names at all. So, let me elaborate more on what you have reported, and then maybe we can create a separate ticket for that, but let's discuss it here first.

I was able to reproduce the error you mention, and you're right, it is about timing (BTW, you don't even need the transaction, adding a short sleep will also prevent the error, but I totally agree it shouldn't be the solution). When a cache node is terminated, the cluster view (Cache.nodes()) is not updated immediately (especially when :pg is used underneath), then by the time another cache command takes place simultaneously, a Nebulex.RPCMultiCallError may be raised. In this case, the node1 is yet reachable but the cache has been stopped already on that node, then you get the error Nebulex.RPCMultiCallError, which makes also sense because the cache is not actually running there anymore.

There are a couple of things I come up with right now:

  1. The quickest workaround could be to catch the exception and ignore it because eventually the failed node will be removed and on the other hand the command has taken place on the other nodes already, so there shouldn't be an issue. But, I think the workaround looks more like "defensive programming", but this is a specific scenario too where you are terminating nodes ofter and simultaneously other nodes work normally executing commands, so maybe it makes sense?
  2. Another alternative would be to implement some sort of retry mechanism, the when you get the error, it will retry till eventually will succeed; you could use ElixirRetry. The only thing is you will have to wrap all cache call up within this retry logic, which it is a bit cumbersome.
  3. I was thinking to add a new option, let's call it :on_rpc_error, with two values: :raise (default maybe?) and :nothing. This will cause when a Nebulex.RPCMultiCallError occurs if you have on_rpc_error: :nothing, the error to be ignored. And there shouldn't be an issue, because again, the node that returned with an error if it was terminated, will be removed from the cluster view eventually. I can make the option available per command and at initialization time, so that, you can set up an initial behavior, but you can overwrite it at any time by providing this option with a different value. The problem here is it may be dangerous because there may be other errors we may want to get the exception, I'll think more about it!
  4. Maybe from the adapter, just skip the Nebulex.RPCMultiCallError when the cause is %RuntimeError{message: "could not lookup Nebulex cache MyApp.Cache.Replicated.Primary because it was not started or it does not exist"}, this means the cache doesn't exist on that node; maybe making raising the exception optional. This could be also dangerous because if you are using dynamic caches, and for some reason, there is an error, in that case, you may want to get the exception in order to fix it. So, overall I think these specific scenarios become very tricky to handle from the adapter itself.

In the meantime, I will continue trying different alternatives. Let me know your thoughts!

@cabol
Copy link
Owner

cabol commented Feb 14, 2021

@dwmcc you mentioned you're calling :init.stop() for terminating the node, wondering if would be possible to try something out, instead of calling :init.stop() directly create a function to wrap the stopping logic up, like so:

def stop do
  :ok = MyApp.Cache.Replicated.leave_cluster()
  :init.stop()
end

I pushed some changes adding leave_cluster/0 and join_cluster functions, so you have to try it out with the latest master branch. Let's see if that solves the issue. Stay tuned!

@dwmcc
Copy link

dwmcc commented Feb 14, 2021

Good morning @cabol -- I'm thoroughly impressed with the response time here!

TL;DR: Just pulled your changes from master and I've tested four times now, previously I was able to repro the error nearly every time but now I'm not able to reproduce it at all! Seems to me that leaving the Nebulex cluster is the cleanest solution on my end.


Thanks for listing those potential solutions and I largely agree with you. The specific scenario when I'm experiencing this issue is I am handing off the state of a genserver on the terminating node, and restarting it on a running node.

Solution (2) is a good idea, albeit I was hopeful I could find a cleaner way rather than repeatedly retrying (also since the new genserver may need to do some work right away after init!).

I also checked the docs for the options I could pass to take/2, hoping one of the options was basically what you proposed in solution (3) above (similar to the on_conflict option in previous Nebulex versions), which is when I realized I could run the command within a transaction targeting only the local node.

All of that is to say, I think these are all potentially valid solutions. I didn't get around to it last night but I was going to try to force the node to leave the libcluster-formed cluster before running :init.stop(), but the solution you pushed is a much more elegant version of that since it scopes that action to the Nebulex cluster directly.

If you're amenable to keeping these new functions in the replicated and partitioned adapters, I think that's an excellent solution for my use-case. Thanks for pushing these changes so quickly!

@dwmcc
Copy link

dwmcc commented Feb 14, 2021

Tangentially, for my understanding, why was Nebulex attempting to run an RPC call on node1 for a take/2 operation?

Is it the after-get deletion action of the take/2 that needs to be replicated to other nodes, notifying them the cache no longer has the value that was "taken"?

Would a get/2 call not run an RPC call with the Replicated adapter, since the data should be local to the node in question?

@cabol
Copy link
Owner

cabol commented Feb 14, 2021

TL;DR: Just pulled your changes from master and I've tested four times now, previously I was able to repro the error nearly every time but now I'm not able to reproduce it at all! Seems to me that leaving the Nebulex cluster is the cleanest solution on my end.

Totally agree, leaving the cluster is the cleanest way, the other alternatives may have important side effects.

Tangentially, for my understanding, why was Nebulex attempting to run an RPC call on node1 for a take/2 operation?
Is it the after-get deletion action of the take/2 that needs to be replicated to other nodes, notifying them the cache no longer has the value that was "taken"?

Indeed, take/2 operation removes the key from the cache, that's why it needs to be replicated because there is an implicit write operation.

Would a get/2 call not run an RPC call with the Replicated adapter, since the data should be local to the node in question?

Right, with the get/2 you should not face the same issue because as you say, the key is retrieved from the local node, that's the principle. Anyway, you can try it out.

If you're amenable to keeping these new functions in the replicated and partitioned adapters, I think that's an excellent solution for my use-case. Thanks for pushing these changes so quickly!

Yes, count with that, join_cluster and leave_cluster will be part of the adapter and part of the v2.0.0, which I hope to publish by the end of next week.

@dwmcc
Copy link

dwmcc commented Feb 14, 2021

@cabol excellent!

I'll continue using the leave_cluster function as it seems to solve the issue we discussed.
I'm realizing I could have worked around this issue by "putting" the value with a TTL on the node that's shutting down, while "getting" the value on the remaining node, so no RPC call is necessary. But even if I went this route, I'm sure I'd find myself wanting to run some command with an underlying RPC call at some point in the future.

Thanks for your help and looking forward to v2.0.0! 🎉

@cabol cabol closed this as completed Feb 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants