Node's that are gracefully shutting down trigger Nebulex errors #113

SophisticaSean · 2021-04-11T05:04:48Z

 (RuntimeError) could not lookup Nebulex cache UsersCache because it was not started or it does not exist
    (nebulex 2.0.7) lib/nebulex/cache/registry.ex:23: Nebulex.Cache.Registry.lookup/1
    (nebulex 2.0.7) lib/nebulex/adapter.ex:38: Nebulex.Adapter.with_meta/2

This happens if we do a deploy or something else (like cron jobs) that spin up and spin down a node. So we'll have 3 nodes normally, and have a cron job node pop up every now and then. If we need to persist something to the cache while that extra node is shutting down then this error happens. The put succeeds to the other healthy running 2 nodes. Instead it should log a warning since I'm very ok not persisting to this shutting down node.

Alternatively, we should get nebulex to handle control signals more cleanly. The errors only happen after we send the shutdown signal to the application. So I wonder if we could tell nebulex to reject or just log a warning when after the app is shutting down and it receives a request to persist to the no-longer-running cache.

SophisticaSean · 2021-04-11T05:14:56Z

cache configuration is this:

defmodule UsersCache do
  use Nebulex.Cache,
    otp_app: :test_app,
    adapter: Nebulex.Adapters.Replicated,
    primary_storage_adapter: Nebulex.Adapters.Local
end

SophisticaSean · 2021-04-12T20:33:31Z

I wonder if I should trap the exit signal and leave the cluster in each cache? Or maybe we should do that automatically in each cache.

cabol · 2021-04-12T21:24:22Z

Hi! May you paste the whole stack trace? Since you are using the Replicated adapter I'd like to see from where this exception is being generated, I think it should be during the bootstrap of the replication cache, but I'd like to confirm. Thanks !!!

cabol · 2021-04-14T10:11:00Z

@SophisticaSean any update? If you can provide the whole stack trace would be great, especially to see in the replicated adapter from where the exception is originated. Stay tuned!

hazardfn · 2021-04-14T16:18:10Z

We also have a similar issue, it's a little different in that the error we receive is ***(Nebulex.RPCMultiCallError) RPC error executing action: *** these happen very often when a node in the cluster is removed/shutdown.

I think the core problem for me is I don't see anywhere in the docs that a put/touch/get raises an error yet we've seen it for all 3 - for a lot of use cases it probably shouldn't do this and I feel semantically only methods annotated with ! should raise, everything else should return {:error, _}

Taking the touch method from the docs as an example:

touch(key)

Specs
touch(key()) :: boolean()

Returns true if the given key exists and the last access time was successfully updated, otherwise, false is returned.

I'd never know looking at that that this method could raise an exception 😢

The issue seems to be that for replicated transactions a single error from any node is raised:

nebulex/lib/nebulex/adapters/replicated.ex

Line 498 in 7dcdd30

    
           raise Nebulex.RPCMultiCallError, action: action, responses: responses, errors: errors

In scenarios where nodes are taken down as part of rolling upgrades etc. I would have expected this kind of problem to be more common or maybe we're missing something in our setup 🤔

cabol · 2021-04-15T19:09:36Z

Definitely, I need to check, because one of the challenges with OTP 23 and :pg is the strong eventual consistency, which means when a node leaves or joins the cluster, it could take some time for :pg to update the view of the group members, leading to this issue, where the adapter gets a cache node and the cache is actually stopped. Anyway, it is a possibility I'll check what could be the best way to handle it. I will try to provide a solution very soon! Thanks!

cabol · 2021-04-17T07:50:08Z

I think the core problem for me is I don't see anywhere in the docs that a put/touch/get raises an error yet we've seen it for all 3 - for a lot of use cases it probably shouldn't do this and I feel semantically only methods annotated with ! should raise, everything else should return {:error, _}

Usually yes, but in Nebulex (as in Ecto for example), if there is an internal issue in run-time with the backend you get an exception. For example, if in Ecto you do Repo.insert(...) (no !) and for some reason, there is an issue with the repo itself, or the DB, etc., you will get an exception. In Nebulex the semantic is similar. So, I agree Nebulex should handle better these scenarios in the replicated adapter when a node is down, to avoid raising the exception when it shouldn't. Anyway, in future releases I'll think about it, maybe adding explicit functions returning {:error, reason} or :error when something happens.

hazardfn · 2021-04-17T07:54:39Z

Yeah I guess you have a point there! I just don't think I researched the replicated cache well enough before using it and that is my fault, I wasn't aware of the atomic nature of it. I think this kind of atomic cache update may be useful for some people and scenarios it just doesn't fit our use case very well.

It might be worth adding a flag that effectively ignores failures to update the cache on other nodes when set to true for people who don't need those guarantees the caches are in sync. It could even cast it to the other nodes when true and if it makes it great, if not no big deal - that would minimise wait times on puts etc.

There can perhaps be some kind of background task that syncs them up at a configured interval if that's something people would want to do.

EDIT: Looking at the code perhaps it's not truly atomic in that the local cache + some nodes probably receive the update but it is kind of all or nothing in terms of erroring out. Perhaps the solution here is that you can choose to make it atomic or not, in atomic mode no cache update is made and an error raised if all caches couldn't be updated instead. Right now it's kind of hybrid it seems in that some caches will be updated but an error will also be thrown.

cabol · 2021-04-17T08:06:16Z

Yeah I guess you have a point there! I just don't think I researched the replicated cache well enough before using it and that is my fault, I wasn't aware of the atomic nature of it. I think this kind of atomic cache update may be useful for some people and scenarios it just doesn't fit our use case very well.

Indeed, depending on your needs, you take one cache topology or another. The replicated topology/adapter is good if most of the operation you will perform are reads because for the reads the access is local like you were hitting a local; cache, but for writers, there is always a replication/sync process underneath, which may be expensive depending on the volume os writes. The partitioned topology on the other hand is great if you don't care about losing part of the data when a node goes down, which actually it shouldn't because it is a cache, so you should always provide a mechanism to load/cache the data back again. The replicated adapter is very challenging because the idea is to keep strong consistency in the cluster, otherwise, it wouldn't make sense. In the near future, perhaps next minor release, my plan is to include Mnesia as backend too, and for the replicated adapter let Mnesia handle the replication and the cluster stuff, that will be much better. In the meantime, I will fix these scenarios.

It might be worth adding a flag that effectively ignores failures to update the cache on other nodes when set to true for people who don't need those guarantees the caches are in sync. It could even cast it to the other nodes when true and if it makes it great, if not no big deal - that would minimise wait times on puts etc.

Yes, that is also a good idea, I'll check it out, actually working on this now! The plan is to fix this issue in the next two days. And in the next release, again, I will introduce Mensia to have a better implementation of the Replicated adapter.

…emoved (#113)

…113)

cabol · 2021-04-18T16:29:28Z

Hey! I've pushed some fixes to the master branch to handle better the errors when the nodes are removed (or added). I tested them already and seem to work, but it would be great you also test the changes and give me your feedback (if the fix works as expected or not). Thanks! Stay tuned!

SophisticaSean · 2021-04-21T19:06:27Z

I'm testing the new changes in production today. Will report back. This was happening every day under the proper circumstances so I should be able to replicate it easily.

SophisticaSean · 2021-04-22T01:33:02Z

Ok I've replicated the error again and its slightly different but mostly the same:

Before changes in this issue:

** (RuntimeError) could not lookup Nebulex cache testapp.Caches.RewardPointsCache because it was not started or it does not exist    
    (nebulex 2.0.7) lib/nebulex/cache/registry.ex:23: Nebulex.Cache.Registry.lookup/1    
    (nebulex 2.0.7) lib/nebulex/adapter.ex:38: Nebulex.Adapter.with_meta/2    
    (testapp 0.1.0) lib/testapp/reward.ex:1: testapp.Reward.do_get_reward_points/1    
    (testapp 0.1.0) lib/testapp/reward.ex:44: testapp.Reward.get_reward_points/1    
    (testapp 0.1.0) lib/testapp/reward/event.ex:94: anonymous fn/1 in testapp.Reward.Event.process_spend_transaction/1    
    (ecto_sql 3.5.4) lib/ecto/adapters/sql.ex:1027: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4    
    (db_connection 2.3.1) lib/db_connection.ex:1444: DBConnection.run_transaction/4    
    (testapp 0.1.0) lib/testapp/reward/storage.ex:19: testapp.Reward.Storage.transaction/1

After:

** (Nebulex.RegistryLookupError) could not lookup Nebulex cache testapp.Caches.RewardPointsCache because it was not started or it does not exist    
    (nebulex 2.0.8) lib/nebulex/cache/registry.ex:22: Nebulex.Cache.Registry.lookup/1    
    (nebulex 2.0.8) lib/nebulex/adapter.ex:38: Nebulex.Adapter.with_meta/2    
    (testapp 0.1.0) lib/testapp/reward.ex:1: testapp.Reward.do_get_reward_points/1    
    (testapp 0.1.0) lib/testapp/reward.ex:44: testapp.Reward.get_reward_points/1    
    (testapp 0.1.0) lib/testapp/reward.ex:77: testapp.Reward.insert_reward_transaction/2    
    (testapp 0.1.0) lib/testapp/reward.ex:181: testapp.Reward.process_earn_transaction/1    
    (testapp 0.1.0) lib/testapp/reward/consumers/earn.ex:28: testapp.Reward.Consumer.Earn.handle_message/3    
    (broadway 0.6.2) lib/broadway/topology/processor_stage.ex:158: Broadway.Topology.ProcessorStage.handle_messages/4

The nebulex version numbers are what we use in our private hex repo

SophisticaSean · 2021-04-22T01:35:06Z

I'm glad its no longer a runtime error but for me these are completely expected when we shutdown a node and shouldn't throw an error. How do you recommend I go about dealing with this error?

I'm wondering if we need to trap exit signals in nebulex and gracefully leave the cluster on CTRL+C or other kill signals so that we don't attempt to replicate to a :DOWN node

cabol · 2021-04-22T06:48:44Z

@SophisticaSean could you describe the scenario? Because, with the fix, while nodes left the cluster, that error from remote stopped caches is skipped, I'm wondering how are you reproducing the error, so I can reproduce it too.

On the other hand, the exception is ok if you attempt to hit a cache that is stopped, anyway, if you can tell me how are you reproducing this error would be great.

hazardfn · 2021-04-22T06:53:14Z

@SophisticaSean for me the patch worked, at least for now I haven't seen the errors resurfacing - looking at the type of error I'd hazard a guess your application child_spec order may not be optimal?

If you have processes above the cache in your child_spec list that use the cache you may see errors like that. This is because on teardown the supervisor closes children in reverse order of the original list, meaning your cache may stop before processes using it do.

If it's not that then perhaps you're hitting an edge case we aren't 😞

SophisticaSean · 2021-04-22T18:58:52Z

Thanks @hazardfn, my caches are definitely last on the totem pole so I'll move them up to the beginning of my child_spec list. I'll test that again today @cabol

The scenario is:
3 application nodes running
some amount of cron job nodes being started and stopped on schedule via kubernetes
Sometimes a request or another cronjob try to persist via replicated cache to a cronjob that is in the process of shutting down.
Then the error posted above is logged (many times) by the cronjob pod/node/container in the process of shutting down.

cabol · 2021-04-22T19:24:39Z

@SophisticaSean great to hear that!

Then the error posted above is logged (many times) by the cronjob pod/node/container in the process of shutting down.

Yes, I added a log trace, at least to report what is happening, not sure if it is useful tho, but any feedback is welcome, I wouldn't mind removing the log trace if it does not add any value.

@hazardfn thanks a lot for your help too! (BTW, I think it would be good to add the tip about the cache and app supervision tree somewhere, it is not obvious).

SophisticaSean · 2021-04-22T19:46:24Z

Yeah if its not something we can completely eliminate, its fine. I'd just prefer it to be a Logger.warn instead. Or configurable to be a Logger.warn instead of an error. We alert to any production errors so it'd be nice to cut out the noise here if we can't eliminate it reasonably.

escobera · 2021-04-23T20:23:07Z

@cabol I'm having the same error on the local adapter as well. Looks like it is trying to lookup an :ets reference that doesn't exist anymore cause I get both the good looking Nebulex error message and a :badarg erlang message. It happens when I take don't the application before deploying a new version.

My cache:

defmodule AntecedentesApi.Caches.ProcessosApi do
  use Nebulex.Cache,
    otp_app: :antecedentes_api,
    adapter: Nebulex.Adapters.Local
end

The errors:

RuntimeError: could not lookup Nebulex cache AntecedentesApi.Caches.ProcessosApi because it was not started or it does not exist
  File "lib/nebulex/cache/registry.ex", line 23, in Nebulex.Cache.Registry.lookup/1
  File "lib/nebulex/adapter.ex", line 38, in Nebulex.Adapter.with_meta/2
  File "lib/antecedentes_api/servicos/processos_api.ex", line 26, in AntecedentesApi.Servicos.ProcessosApi.carregar_dados_processo/2
  File "lib/antecedentes_api/movimentacao_pipeline.ex", line 22, in AntecedentesApi.MovimentacaoPipeline.atualizar/1
  File "lib/antecedentes_api/consumer.ex", line 88, in AntecedentesApi.Consumer.handle_message/2
  File "/app/deps/erlkaf/src/erlkaf_consumer.erl", line 195, in :erlkaf_consumer.process_events_one_by_one/5
  File "/app/deps/erlkaf/src/erlkaf_consumer.erl", line 116, in :erlkaf_consumer.handle_info/2
  File "gen_server.erl", line 689, in :gen_server.try_dispatch/4

and

ArgumentError: argument error
Module "ets", in :ets.lookup/2
File "lib/nebulex/adapters/local.ex", line 687, in Nebulex.Adapters.Local.get_entry/4
File "lib/nebulex/adapters/local.ex", line 420, in Nebulex.Adapters.Local.gen_fetch/5
File "lib/nebulex/adapters/local.ex", line 411, in Nebulex.Adapters.Local.do_get/4
File "lib/nebulex/adapters/local.ex", line 401, in Nebulex.Adapters.Local.get/3
File "lib/antecedentes_api/servicos/processos_api.ex", line 26, in AntecedentesApi.Servicos.ProcessosApi.carregar_dados_processo/2
File "lib/antecedentes_api/movimentacao_pipeline.ex", line 22, in AntecedentesApi.MovimentacaoPipeline.atualizar/1
File "lib/antecedentes_api/consumer.ex", line 88, in AntecedentesApi.Consumer.handle_message/2

SophisticaSean · 2021-04-23T20:31:08Z

@cabol your changes + the changes in child spec have ameliorated this issue. Where would you like the child spec order documentation to live? I'm happy to make that change.

SophisticaSean · 2021-04-24T00:05:17Z

I think I spoke too soon:
Got this error today:
Elixir.Nebulex.RPCMultiCallError: RPC error executing action: put Responses: [:ok, :ok, :ok, :ok] Errors: [ {{:error, {:exception, :badarg, [ {:ets, :insert, [ #Reference<62137.3298164560.1032978433.158844>, {:entry, "34f1065f-1014-49bc-878d-8262ff987941", {:ok, %{confirmed_points: #Decimal<23.02>, pending_points: #Decimal<0>}}, 1619222448634, :infinity} ], []}, {Nebulex.Adapters.Local, :put_entries, 3, [file: 'lib/nebulex/adapters/local.ex', line: 731]}, {Nebulex.Adapters.Local, :do_put, 5, [file: 'lib/nebulex/adapters/local.ex', line: 466]}, {Nebulex.Cache.Entry, :put, 4, [file: 'lib/nebulex/cache/entry.ex', line: 42]}, {Nebulex.Adapters.Replicated, :with_dynamic_cache, 3, [file: 'lib/nebulex/adapters/replicated.ex', line: 439]} ]}}, :"testapp@10.52.3.67"} ]

SophisticaSean · 2021-04-24T00:06:12Z

This happened under those circumstances I described above

cabol · 2021-04-24T08:32:33Z

@escobera I think your case is different, first because you are using the local adapter directly (not the replicated one), and if you are getting that error it is because the cache has not been started yet, or has been stopped, not sure, but in that case, the error is justified. Maybe you are starting a process that uses the cache before starting the cache itself as @hazardfn described above (starting the cache first in your app supervision tree)? If you can provide more details about how to reproduce the error would be great, also for this one it would be better a separate issue. Thanks, stay tuned!

cabol · 2021-04-24T08:38:51Z

@SophisticaSean are you sure you are using the latest master branch? Because if I go to this line file: 'lib/nebulex/adapters/replicated.ex', line: 439 it takes me HERE, where there is no code, so it doesn't make sense. If you can double-check you are actually using the latest master branch would be great, stay tuned!

SophisticaSean · 2021-04-25T11:31:01Z

@cabol correct, I'm on my nebulex telemetry fork so here is line 439: https://github.com/SophisticaSean/nebulex/blob/sophisticasean/support-before-after-blocks-in-decorator/lib/nebulex/adapters/replicated.ex#L439 but it is up to date to master here.

cabol · 2021-04-25T12:43:54Z

@SophisticaSean ok, but just to check, can you try to reproduce the error without your changes, just using the master branch? And please share the error and stacktrace. Thanks!

cabol · 2021-05-01T11:12:07Z

@SophisticaSean any update, have you been able to test with the master branch (not with your fork)?

cabol · 2021-05-14T11:33:03Z

Since I've couldn't reproduce the errors again with the latest fixes and improvements, and also no new reports or feedback about it, I will close the issue, but feel free to reopen it if you came across the same error again.

cabol added a commit that referenced this issue Apr 18, 2021

Fix replicated adapter to better handle errors when nodes are added/r…

6727e01

…emoved (#113)

cabol added a commit that referenced this issue Apr 18, 2021

Fix replicated adapter to handle errors when nodes are added/removed (#…

0b42b46

…113)

cabol added a commit that referenced this issue Apr 18, 2021

Fix replicated adapter to handle errors when nodes are added/removed (#…

1a4261f

…113)

cabol added a commit that referenced this issue Apr 24, 2021

Remove replication error logging and emit telemetry event instead (#113)

7d47171

cabol closed this as completed May 14, 2021

SophisticaSean mentioned this issue Jul 31, 2021

New telemetry isn't user friendly #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node's that are gracefully shutting down trigger Nebulex errors #113

Node's that are gracefully shutting down trigger Nebulex errors #113

SophisticaSean commented Apr 11, 2021 •

edited

Loading

SophisticaSean commented Apr 11, 2021 •

edited

Loading

SophisticaSean commented Apr 12, 2021

cabol commented Apr 12, 2021

cabol commented Apr 14, 2021

hazardfn commented Apr 14, 2021 •

edited

Loading

cabol commented Apr 15, 2021

cabol commented Apr 17, 2021 •

edited

Loading

hazardfn commented Apr 17, 2021 •

edited

Loading

cabol commented Apr 17, 2021 •

edited

Loading

cabol commented Apr 18, 2021

SophisticaSean commented Apr 21, 2021

SophisticaSean commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021 •

edited

Loading

cabol commented Apr 22, 2021

hazardfn commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021 •

edited

Loading

cabol commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021

escobera commented Apr 23, 2021

SophisticaSean commented Apr 23, 2021

SophisticaSean commented Apr 24, 2021 •

edited

Loading

SophisticaSean commented Apr 24, 2021

cabol commented Apr 24, 2021

cabol commented Apr 24, 2021

SophisticaSean commented Apr 25, 2021

cabol commented Apr 25, 2021

cabol commented May 1, 2021

cabol commented May 14, 2021

Node's that are gracefully shutting down trigger Nebulex errors #113

Node's that are gracefully shutting down trigger Nebulex errors #113

Comments

SophisticaSean commented Apr 11, 2021 • edited Loading

SophisticaSean commented Apr 11, 2021 • edited Loading

SophisticaSean commented Apr 12, 2021

cabol commented Apr 12, 2021

cabol commented Apr 14, 2021

hazardfn commented Apr 14, 2021 • edited Loading

cabol commented Apr 15, 2021

cabol commented Apr 17, 2021 • edited Loading

hazardfn commented Apr 17, 2021 • edited Loading

cabol commented Apr 17, 2021 • edited Loading

cabol commented Apr 18, 2021

SophisticaSean commented Apr 21, 2021

SophisticaSean commented Apr 22, 2021 • edited Loading

SophisticaSean commented Apr 22, 2021 • edited Loading

cabol commented Apr 22, 2021

hazardfn commented Apr 22, 2021 • edited Loading

SophisticaSean commented Apr 22, 2021 • edited Loading

cabol commented Apr 22, 2021 • edited Loading

SophisticaSean commented Apr 22, 2021

escobera commented Apr 23, 2021

SophisticaSean commented Apr 23, 2021

SophisticaSean commented Apr 24, 2021 • edited Loading

SophisticaSean commented Apr 24, 2021

cabol commented Apr 24, 2021

cabol commented Apr 24, 2021

SophisticaSean commented Apr 25, 2021

cabol commented Apr 25, 2021

cabol commented May 1, 2021

cabol commented May 14, 2021

SophisticaSean commented Apr 11, 2021 •

edited

Loading

SophisticaSean commented Apr 11, 2021 •

edited

Loading

hazardfn commented Apr 14, 2021 •

edited

Loading

cabol commented Apr 17, 2021 •

edited

Loading

hazardfn commented Apr 17, 2021 •

edited

Loading

cabol commented Apr 17, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021 •

edited

Loading

hazardfn commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 22, 2021 •

edited

Loading

cabol commented Apr 22, 2021 •

edited

Loading

SophisticaSean commented Apr 24, 2021 •

edited

Loading