Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

Closed
hubertlepicki opened this issue Oct 1, 2024 · 4 comments

Comments

@hubertlepicki
Copy link
Contributor

hubertlepicki commented Oct 1, 2024

I don't actually know if this could be considered a bug in Absinthe, or something that needs to be changed some way - maybe only be documented and maybe this issue will suffice for that purpose. I think it would be nice if this did not happen but not sure how to prevent this.

In short, I have an Elixir Phoenix application deployed on Kubernetes GKE Autopilot cluster. I have noticed that mutations are not always trigged reliabily. It depended on which pod in the cluster received web request to handle the mutation.

On some pods, the mutation would execute just fine and a subscription is triggered and propagated to JavaScript clients 100% reliabily. On other pods, a mutation happened, it succeeded, but no message was broadcasted to clients at all that a subscription was triggered at all.

I identified what is different about these pods that do not work: they had lowered number of CPU cores assigned to them by the Autopilot cluster. The issue would also fix itself if all pods received fair amount of traffic - the Kubernetes cluster would scale them up all to maximum number of CPUs and the issue was gone. This made it quite difficult to debug.

I think the code over here decides on a topic name, and one of the parameters being taken into consideration is :pool_size

def publish_remote(pubsub, mutation_result, subscribed_fields) do
{:ok, pool_size} =
pubsub
|> registry_name
|> Registry.meta(:pool_size)
shard = :erlang.phash2(mutation_result, pool_size)
proxy_topic = Subscription.Proxy.topic(shard)
:ok = pubsub.publish_mutation(proxy_topic, mutation_result, subscribed_fields)
end

Now, the :pool_size defaults to System.schedulers() * 2:

* `:pool_size` - (Optional - default `System.schedulers() * 2`) An integer
specifying the number of `Absinthe.Subscription.Proxy` processes to start.

My cluster consists of pods that, have 2 CPU cores, and those that have 4 CPU cores. The generated topic names for my mutation were "__absinthe__:proxy:6" vs "__absinthe__:proxy:2", depending on which pod the mutation arrived to from load balancer.

I believe the problem is that the number of shards/proxies started on each pod in the cluster depends on the of CPUs, and the code over here starts different number of proxies, and some topics are simply not being listened on:

proxies =
for shard <- 0..(pool_size - 1) do

The fix was specifying a fixed :pool_size, so instead of:

{Absinthe.Subscription, MyApp.Endpoint}

I did

{Absinthe.Subscription, pubsub: MyApp.Endpoint, pool_size: 8}

Again, I don't know what can be done here, but maybe we should think of a mechanism to warn about this (if we can detect?)

@hubertlepicki hubertlepicki changed the title Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each with default settings Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default settings are used Oct 1, 2024
@hubertlepicki hubertlepicki changed the title Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default settings are used Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used Oct 1, 2024
@benwilson512
Copy link
Contributor

@hubertlepicki this is a really interesting problem. I think you are 100% right that if you have different sized pods you'll get different topic counts, and this will just result in missed messages between nodes.

Specifying a fixed count is definitely one solution. M:N topic mapping definitely seems like a problem someone has solved before but nothing jumps immediately to mind.

@hubertlepicki
Copy link
Contributor Author

@benwilson512 since I wrote the above, I have also confirmed this is happening on Gigalixir. I guess they run a similar set up on Kubernetes cluster. One of my pods on Gigalixir is reporting 32, and the other one 16 by System.schedulers()

Maybe we should add a note to setup instructions that if you're doing a cloud deployment, and you can't control the number of CPUs, you should specify a fixed :pool_size?

hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024
hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024
hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024
hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024
@hubertlepicki
Copy link
Contributor Author

@benwilson512 I have provided the documentation changes here: https://github.com/absinthe-graphql/absinthe/pull/1343/files please merge them if you feel this is beneficial to document.

benwilson512 added a commit that referenced this issue Oct 2, 2024
…-pool-size-documentation

[#1342] Add note on the need of fixed :pool_size to guide
@hubertlepicki
Copy link
Contributor Author

@benwilson512 I think this can be closed now which I'm going to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants