Partitioned Cache + stats + multiple nodes causes failure #77

zachdaniel · 2020-08-06T19:50:50Z

The partitioned cache doesn't define a module for the local adapter, and so when it tries to RPC to another node to call a function on that local adapter, it fails. How is it supposed to work if its not defining a module?

zachdaniel · 2020-08-06T20:26:45Z

It looks like the real problem has to do with local caches on each node trying to use the stats ref passed down by the parent cache.

zachdaniel · 2020-08-06T20:27:02Z

I get an argument error whenever it uses the rpc.

cabol · 2020-08-07T12:50:42Z

One of the changes was precisely don't depend on declaring a local cache explicitly, instead, the partitioned adapter handless it. Now, I'm confused about your issue and I think I'll nee more details, does it happens only when you set stats: true or also without the :stats option? May you please provide the cache module and config so I can try to reproduce the issue?

zachdaniel · 2020-08-07T12:59:42Z

Totally :) I was in a bit of a rush yesterday, but I'll get the issue reproduced in a minimal app and let you know. Specifically, this happens when I enable stats: true in the partitioned cache. Disabling that eliminated the error.

cabol · 2020-08-07T14:14:50Z

Ok, got it, that info is helpful, I'll try to reproduce with that (a partitioned cache with stats: true), Thanks 👍

cabol · 2020-08-10T10:45:41Z

I found the issue. I need to find another way for handling the counters in a distributed fashion, because the problem is, when the RPC is executed, the command on the distributed node is being performed by using the counter reference from the calling node. Even if that is solved, the stats aggregation across all cluster nodes needs to happen anyways, otherwise, by the time the stats are consulted, the result will contain only the values in that node. Therefore, perhaps I'll take some time to sort it out; I'll try to push a fix as soon as I can, thanks!

zachdaniel · 2020-08-12T14:46:48Z

@cabol one thing you might consider: I'm not so sure you want stats to aggregate across nodes. We report our metrics to datadog for example, and it understands the hosts that our metrics are coming from. If every node sent the sum total of all nodes, it would be extra work to disambiguate that. Perhaps it would be possible to just enable stats for the primary local cache that the distributed cache creates automatically? I can see reasons you might want to have stats aggregate across the cluster, so I'm thinking making it optional with something like:

config :my_app, MyApp.Cache,
  stats: true, <- this enables aggregates in total
  primary: [
    stats: true, <- this enables just node specific aggregates
    adapter: Nebulex.Adapters.Local,
    backend: :shards
  ]

With this you could have both, neither, or one enabled.

cabol · 2020-08-14T08:28:05Z

You're right, I will fix it to enable stats for the primary local cache that the distributed cache creates automatically (like you mention). Besides, getting the aggregated stats should be easy, there is a function to retrieve the cluster nodes (Cache.nodes()), and then calling Cache.stats_info() for each of them.

Ivor · 2020-09-23T11:17:42Z

Experienced the same issue on ReplicatedCache. Appreciate the hard work. 👍

cabol · 2020-09-23T13:05:07Z

Yeah, I've been working on some fixes and improvements that should solve this issue, I'll push it soon!

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol · 2020-09-24T20:39:04Z

I've pushed some fixes and improvements to master which I think should solve this issue. However, I've had to do some refactorings to the adapters, so there are some minor changes in the config, I highly recommend to take a look at the adapters you're using. Please try it out and let me know if the problem persists, stay tuned, thanks!

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

Ivor · 2020-09-28T09:16:56Z

Thanks, will give it a shot and report what we find!

franc · 2020-09-29T07:18:18Z

Thanks @cabol, I can confirm that it is working for a replicated cache.

Looking at the phoenix dashboard live view output on 2 different nodes I can see that the hits and misses differ per node, but that the writes are the same. This makes sense as the cache is replicated. The cache size is a bit smaller than the amount of writes, and I assume this is because of soem concurrent caching/race conditions that might mean that some writes happened on both primaries. (This was before any evictions/expires).
I expected expires and evictions to also be mirrored on both nodes like writes are, but it seems like they are not. Or at least not counted.
I haven't looked at the Stats code much, but it may be useful to have a counter to distinguish between writes to local and writes via replication.

cabol · 2020-09-30T14:52:29Z

Thanks @cabol, I can confirm that it is working for a replicated cache.

Glad to hear it is working now 😄

Looking at the phoenix dashboard live view output on 2 different nodes I can see that the hits and misses differ per node, but that the writes are the same. This makes sense as the cache is replicated. The cache size is a bit smaller than the amount of writes, and I assume this is because of soem concurrent caching/race conditions that might mean that some writes happened on both primaries. (This was before any evictions/expires).
I expected expires and evictions to also be mirrored on both nodes like writes are, but it seems like they are not. Or at least not counted.

As you mentioned, since it is a replicated cache, the writes will be the same, because each write is replicated to all nodes in the cluster. On the other hand, the reads are resolved locally, then it is normal to have different values for hits and misses per node. I think the idea of this is with telemetry and maybe by using the StatsD reporter you can aggregate the values for those metrics, then if you have two nodes, one node with 3 hits and 2 misses and the other one with 5 hits and 3 misses, in the end, you may want to see: hits: 8 and misses: 5. If you dispatch the stats via telemetry and use something like StatsD you should see the values in that way. However, if you are not using telemetry and StatsD, it is possible to aggregate the metrics (except writes since it is the same for all nodes) by calling Cache.stats_info() on each node of the cluster and you can get them by calling Cache.nodes(). Perhaps it would be good to add the tag node when dispatching the telemetry event, so we can also filter/aggregate stats by node, I'll check it out.

I haven't looked at the Stats code much, but it may be useful to have a counter to distinguish between writes to local and writes via replication.

Yeah, agreed, it would be useful too, let me take a look at it and see if there is something not too complicated we can do, otherwise, I can create a separate ticket and tag it as an improvement.

Thanks 👍

franc · 2020-10-01T10:03:21Z

yep, I'm using telemetry and statsD accross 2 nodes and then looking at it in datadog - it is easy to aggregage or distinguish in datadog based on host.
I understand writes being the same while hits and misses would be different. What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same.

cabol · 2020-10-01T10:42:19Z

I understand writes being the same while hits and misses would be different. What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same.

Good point, yes, let me check it out because when the GC runs or a new generation is created, it should be applied in all the nodes, perhaps it is related to that, if so, that would require some sort of sync in the cluster, I'll check!!!

cabol · 2020-10-29T16:39:52Z

I think we can close this issue, feel free to re-open it if the issue persists!

BTW, Added also an example of how to add the node to the event so it can be used in the metric tags: https://github.com/cabol/nebulex/blob/master/guides/telemetry.md#adding-other-custom-metrics.

PD: @franc may you create a separate ticket for What is curious to me is that evictions and expires are different too. my expectation would be that they are mirrored over the cluster to eventually be the same ?

zachdaniel changed the title ~~Confused about how the partitioned cache is supposed to work in 2.0.0+~~ Partitioned Cache + stats + multiple nodes causes failure Aug 12, 2020

cabol added a commit that referenced this issue Sep 24, 2020

Fix and improve adapters

fb11383

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol added a commit that referenced this issue Sep 24, 2020

Fix and improve adapters

5bfa880

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol added a commit that referenced this issue Sep 24, 2020

Fix and improve adapters

56d0405

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol added a commit that referenced this issue Sep 24, 2020

Fix and improve adapters

a86a664

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol added a commit that referenced this issue Sep 24, 2020

Fix and improve adapters

2e87d17

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol added a commit that referenced this issue Sep 26, 2020

Fix and improve adapters

10a513a

- Refactor replicated, partitioned, and multilevel adapters to call a cache instead of an adapter directly - This should fix issue #77

cabol closed this as completed Nov 15, 2020

cabol mentioned this issue Jan 5, 2021

Improve replicated adapter to ensure better consistency across the nodes #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned Cache + stats + multiple nodes causes failure #77

Partitioned Cache + stats + multiple nodes causes failure #77

zachdaniel commented Aug 6, 2020

zachdaniel commented Aug 6, 2020

zachdaniel commented Aug 6, 2020

cabol commented Aug 7, 2020

zachdaniel commented Aug 7, 2020

cabol commented Aug 7, 2020

cabol commented Aug 10, 2020 •

edited

Loading

zachdaniel commented Aug 12, 2020

cabol commented Aug 14, 2020 •

edited

Loading

Ivor commented Sep 23, 2020

cabol commented Sep 23, 2020

cabol commented Sep 24, 2020

Ivor commented Sep 28, 2020

franc commented Sep 29, 2020

cabol commented Sep 30, 2020 •

edited

Loading

franc commented Oct 1, 2020

cabol commented Oct 1, 2020

cabol commented Oct 29, 2020 •

edited

Loading

Partitioned Cache + stats + multiple nodes causes failure #77

Partitioned Cache + stats + multiple nodes causes failure #77

Comments

zachdaniel commented Aug 6, 2020

zachdaniel commented Aug 6, 2020

zachdaniel commented Aug 6, 2020

cabol commented Aug 7, 2020

zachdaniel commented Aug 7, 2020

cabol commented Aug 7, 2020

cabol commented Aug 10, 2020 • edited Loading

zachdaniel commented Aug 12, 2020

cabol commented Aug 14, 2020 • edited Loading

Ivor commented Sep 23, 2020

cabol commented Sep 23, 2020

cabol commented Sep 24, 2020

Ivor commented Sep 28, 2020

franc commented Sep 29, 2020

cabol commented Sep 30, 2020 • edited Loading

franc commented Oct 1, 2020

cabol commented Oct 1, 2020

cabol commented Oct 29, 2020 • edited Loading

cabol commented Aug 10, 2020 •

edited

Loading

cabol commented Aug 14, 2020 •

edited

Loading

cabol commented Sep 30, 2020 •

edited

Loading

cabol commented Oct 29, 2020 •

edited

Loading