-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node's that are gracefully shutting down trigger Nebulex errors #113
Comments
cache configuration is this: defmodule UsersCache do
use Nebulex.Cache,
otp_app: :test_app,
adapter: Nebulex.Adapters.Replicated,
primary_storage_adapter: Nebulex.Adapters.Local
end |
I wonder if I should trap the exit signal and leave the cluster in each cache? Or maybe we should do that automatically in each cache. |
Hi! May you paste the whole stack trace? Since you are using the Replicated adapter I'd like to see from where this exception is being generated, I think it should be during the bootstrap of the replication cache, but I'd like to confirm. Thanks !!! |
@SophisticaSean any update? If you can provide the whole stack trace would be great, especially to see in the replicated adapter from where the exception is originated. Stay tuned! |
We also have a similar issue, it's a little different in that the error we receive is I think the core problem for me is I don't see anywhere in the docs that a Taking the
I'd never know looking at that that this method could raise an exception 😢 The issue seems to be that for replicated transactions a single error from any node is raised: nebulex/lib/nebulex/adapters/replicated.ex Line 498 in 7dcdd30
In scenarios where nodes are taken down as part of rolling upgrades etc. I would have expected this kind of problem to be more common or maybe we're missing something in our setup 🤔 |
Definitely, I need to check, because one of the challenges with OTP 23 and |
Usually yes, but in Nebulex (as in Ecto for example), if there is an internal issue in run-time with the backend you get an exception. For example, if in Ecto you do |
Yeah I guess you have a point there! I just don't think I researched the replicated cache well enough before using it and that is my fault, I wasn't aware of the atomic nature of it. I think this kind of atomic cache update may be useful for some people and scenarios it just doesn't fit our use case very well. It might be worth adding a flag that effectively ignores failures to update the cache on other nodes when set to true for people who don't need those guarantees the caches are in sync. It could even cast it to the other nodes when true and if it makes it great, if not no big deal - that would minimise wait times on puts etc. There can perhaps be some kind of background task that syncs them up at a configured interval if that's something people would want to do. EDIT: Looking at the code perhaps it's not truly atomic in that the local cache + some nodes probably receive the update but it is kind of all or nothing in terms of erroring out. Perhaps the solution here is that you can choose to make it atomic or not, in atomic mode no cache update is made and an error raised if all caches couldn't be updated instead. Right now it's kind of hybrid it seems in that some caches will be updated but an error will also be thrown. |
Indeed, depending on your needs, you take one cache topology or another. The replicated topology/adapter is good if most of the operation you will perform are reads because for the reads the access is local like you were hitting a local; cache, but for writers, there is always a replication/sync process underneath, which may be expensive depending on the volume os writes. The partitioned topology on the other hand is great if you don't care about losing part of the data when a node goes down, which actually it shouldn't because it is a cache, so you should always provide a mechanism to load/cache the data back again. The replicated adapter is very challenging because the idea is to keep strong consistency in the cluster, otherwise, it wouldn't make sense. In the near future, perhaps next minor release, my plan is to include Mnesia as backend too, and for the replicated adapter let Mnesia handle the replication and the cluster stuff, that will be much better. In the meantime, I will fix these scenarios.
Yes, that is also a good idea, I'll check it out, actually working on this now! The plan is to fix this issue in the next two days. And in the next release, again, I will introduce Mensia to have a better implementation of the Replicated adapter. |
Hey! I've pushed some fixes to the master branch to handle better the errors when the nodes are removed (or added). I tested them already and seem to work, but it would be great you also test the changes and give me your feedback (if the fix works as expected or not). Thanks! Stay tuned! |
I'm testing the new changes in production today. Will report back. This was happening every day under the proper circumstances so I should be able to replicate it easily. |
Ok I've replicated the error again and its slightly different but mostly the same: Before changes in this issue: ** (RuntimeError) could not lookup Nebulex cache testapp.Caches.RewardPointsCache because it was not started or it does not exist
(nebulex 2.0.7) lib/nebulex/cache/registry.ex:23: Nebulex.Cache.Registry.lookup/1
(nebulex 2.0.7) lib/nebulex/adapter.ex:38: Nebulex.Adapter.with_meta/2
(testapp 0.1.0) lib/testapp/reward.ex:1: testapp.Reward.do_get_reward_points/1
(testapp 0.1.0) lib/testapp/reward.ex:44: testapp.Reward.get_reward_points/1
(testapp 0.1.0) lib/testapp/reward/event.ex:94: anonymous fn/1 in testapp.Reward.Event.process_spend_transaction/1
(ecto_sql 3.5.4) lib/ecto/adapters/sql.ex:1027: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
(db_connection 2.3.1) lib/db_connection.ex:1444: DBConnection.run_transaction/4
(testapp 0.1.0) lib/testapp/reward/storage.ex:19: testapp.Reward.Storage.transaction/1 After: ** (Nebulex.RegistryLookupError) could not lookup Nebulex cache testapp.Caches.RewardPointsCache because it was not started or it does not exist
(nebulex 2.0.8) lib/nebulex/cache/registry.ex:22: Nebulex.Cache.Registry.lookup/1
(nebulex 2.0.8) lib/nebulex/adapter.ex:38: Nebulex.Adapter.with_meta/2
(testapp 0.1.0) lib/testapp/reward.ex:1: testapp.Reward.do_get_reward_points/1
(testapp 0.1.0) lib/testapp/reward.ex:44: testapp.Reward.get_reward_points/1
(testapp 0.1.0) lib/testapp/reward.ex:77: testapp.Reward.insert_reward_transaction/2
(testapp 0.1.0) lib/testapp/reward.ex:181: testapp.Reward.process_earn_transaction/1
(testapp 0.1.0) lib/testapp/reward/consumers/earn.ex:28: testapp.Reward.Consumer.Earn.handle_message/3
(broadway 0.6.2) lib/broadway/topology/processor_stage.ex:158: Broadway.Topology.ProcessorStage.handle_messages/4 The nebulex version numbers are what we use in our private hex repo |
I'm glad its no longer a runtime error but for me these are completely expected when we shutdown a node and shouldn't throw an error. How do you recommend I go about dealing with this error? I'm wondering if we need to trap exit signals in nebulex and gracefully leave the cluster on |
@SophisticaSean could you describe the scenario? Because, with the fix, while nodes left the cluster, that error from remote stopped caches is skipped, I'm wondering how are you reproducing the error, so I can reproduce it too. On the other hand, the exception is ok if you attempt to hit a cache that is stopped, anyway, if you can tell me how are you reproducing this error would be great. |
@SophisticaSean for me the patch worked, at least for now I haven't seen the errors resurfacing - looking at the type of error I'd hazard a guess your application child_spec order may not be optimal? If you have processes above the cache in your child_spec list that use the cache you may see errors like that. This is because on teardown the supervisor closes children in reverse order of the original list, meaning your cache may stop before processes using it do. If it's not that then perhaps you're hitting an edge case we aren't 😞 |
Thanks @hazardfn, my caches are definitely last on the totem pole so I'll move them up to the beginning of my child_spec list. I'll test that again today @cabol The scenario is: |
@SophisticaSean great to hear that!
Yes, I added a log trace, at least to report what is happening, not sure if it is useful tho, but any feedback is welcome, I wouldn't mind removing the log trace if it does not add any value. @hazardfn thanks a lot for your help too! (BTW, I think it would be good to add the tip about the cache and app supervision tree somewhere, it is not obvious). |
Yeah if its not something we can completely eliminate, its fine. I'd just prefer it to be a |
@cabol I'm having the same error on the local adapter as well. Looks like it is trying to lookup an :ets reference that doesn't exist anymore cause I get both the good looking Nebulex error message and a :badarg erlang message. It happens when I take don't the application before deploying a new version. My cache: defmodule AntecedentesApi.Caches.ProcessosApi do
use Nebulex.Cache,
otp_app: :antecedentes_api,
adapter: Nebulex.Adapters.Local
end The errors:
and
|
@cabol your changes + the changes in child spec have ameliorated this issue. Where would you like the child spec order documentation to live? I'm happy to make that change. |
I think I spoke too soon: |
This happened under those circumstances I described above |
@escobera I think your case is different, first because you are using the local adapter directly (not the replicated one), and if you are getting that error it is because the cache has not been started yet, or has been stopped, not sure, but in that case, the error is justified. Maybe you are starting a process that uses the cache before starting the cache itself as @hazardfn described above (starting the cache first in your app supervision tree)? If you can provide more details about how to reproduce the error would be great, also for this one it would be better a separate issue. Thanks, stay tuned! |
@SophisticaSean are you sure you are using the latest master branch? Because if I go to this line |
@cabol correct, I'm on my nebulex telemetry fork so here is line 439: https://github.com/SophisticaSean/nebulex/blob/sophisticasean/support-before-after-blocks-in-decorator/lib/nebulex/adapters/replicated.ex#L439 but it is up to date to master here. |
@SophisticaSean ok, but just to check, can you try to reproduce the error without your changes, just using the master branch? And please share the error and stacktrace. Thanks! |
@SophisticaSean any update, have you been able to test with the master branch (not with your fork)? |
Since I've couldn't reproduce the errors again with the latest fixes and improvements, and also no new reports or feedback about it, I will close the issue, but feel free to reopen it if you came across the same error again. |
This happens if we do a deploy or something else (like cron jobs) that spin up and spin down a node. So we'll have 3 nodes normally, and have a cron job node pop up every now and then. If we need to persist something to the cache while that extra node is shutting down then this error happens. The put succeeds to the other healthy running 2 nodes. Instead it should log a warning since I'm very ok not persisting to this shutting down node.
Alternatively, we should get nebulex to handle control signals more cleanly. The errors only happen after we send the shutdown signal to the application. So I wonder if we could tell nebulex to reject or just log a warning when after the app is shutting down and it receives a request to persist to the no-longer-running cache.
The text was updated successfully, but these errors were encountered: