-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitioned Cache + stats + multiple nodes causes failure #77
Comments
It looks like the real problem has to do with local caches on each node trying to use the stats |
I get an argument error whenever it uses the rpc. |
One of the changes was precisely don't depend on declaring a local cache explicitly, instead, the partitioned adapter handless it. Now, I'm confused about your issue and I think I'll nee more details, does it happens only when you set |
Totally :) I was in a bit of a rush yesterday, but I'll get the issue reproduced in a minimal app and let you know. Specifically, this happens when I enable |
Ok, got it, that info is helpful, I'll try to reproduce with that (a partitioned cache with |
I found the issue. I need to find another way for handling the counters in a distributed fashion, because the problem is, when the RPC is executed, the command on the distributed node is being performed by using the counter reference from the calling node. Even if that is solved, the stats aggregation across all cluster nodes needs to happen anyways, otherwise, by the time the stats are consulted, the result will contain only the values in that node. Therefore, perhaps I'll take some time to sort it out; I'll try to push a fix as soon as I can, thanks! |
@cabol one thing you might consider: I'm not so sure you want stats to aggregate across nodes. We report our metrics to datadog for example, and it understands the hosts that our metrics are coming from. If every node sent the sum total of all nodes, it would be extra work to disambiguate that. Perhaps it would be possible to just enable stats for the primary local cache that the distributed cache creates automatically? I can see reasons you might want to have stats aggregate across the cluster, so I'm thinking making it optional with something like:
With this you could have both, neither, or one enabled. |
You're right, I will fix it to enable stats for the primary local cache that the distributed cache creates automatically (like you mention). Besides, getting the aggregated stats should be easy, there is a function to retrieve the cluster nodes ( |
Experienced the same issue on ReplicatedCache. Appreciate the hard work. 👍 |
Yeah, I've been working on some fixes and improvements that should solve this issue, I'll push it soon! |
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
I've pushed some fixes and improvements to master which I think should solve this issue. However, I've had to do some refactorings to the adapters, so there are some minor changes in the config, I highly recommend to take a look at the adapters you're using. Please try it out and let me know if the problem persists, stay tuned, thanks! |
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77
Thanks, will give it a shot and report what we find! |
Thanks @cabol, I can confirm that it is working for a replicated cache. Looking at the phoenix dashboard live view output on 2 different nodes I can see that the hits and misses differ per node, but that the writes are the same. This makes sense as the cache is replicated. The cache size is a bit smaller than the amount of writes, and I assume this is because of soem concurrent caching/race conditions that might mean that some writes happened on both primaries. (This was before any evictions/expires). |
Glad to hear it is working now 😄
As you mentioned, since it is a replicated cache, the
Yeah, agreed, it would be useful too, let me take a look at it and see if there is something not too complicated we can do, otherwise, I can create a separate ticket and tag it as an improvement. Thanks 👍 |
yep, I'm using telemetry and statsD accross 2 nodes and then looking at it in datadog - it is easy to aggregage or distinguish in datadog based on host. |
Good point, yes, let me check it out because when the GC runs or a new generation is created, it should be applied in all the nodes, perhaps it is related to that, if so, that would require some sort of sync in the cluster, I'll check!!! |
I think we can close this issue, feel free to re-open it if the issue persists! BTW, Added also an example of how to add the node to the event so it can be used in the metric tags: https://github.com/cabol/nebulex/blob/master/guides/telemetry.md#adding-other-custom-metrics. PD: @franc may you create a separate ticket for |
The partitioned cache doesn't define a module for the local adapter, and so when it tries to RPC to another node to call a function on that local adapter, it fails. How is it supposed to work if its not defining a module?
The text was updated successfully, but these errors were encountered: