Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenServer get restarted twice after node restarts #224

Open
mszmurlo opened this issue Mar 7, 2021 · 0 comments
Open

GenServer get restarted twice after node restarts #224

mszmurlo opened this issue Mar 7, 2021 · 0 comments

Comments

@mszmurlo
Copy link

mszmurlo commented Mar 7, 2021

I recently ran into the following issue. A GenServer (called CA in the rest of the post) gets restarted two times by the
Horde.DynamicSupervisor when nodes had been removed and re-included few times in the cluster.
The source code demonstrating the issue is available on Gitlab.

I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.

Demonstration

To demonstrate the issue:

  1. Start the application
  2. Create a GenServer (module POC.CA in my code) with the HTTP interface, i.e. that is execute the POCWeb.Login.exec() controller
  3. Kill (with two times Control-c) the node on which the CA is running, restart it, re-do the operation and observe the logs

The two first steps are as follows:

  • Start node 1 in one terminal : $ HTTP_PORT=5001 ERL_AFLAGS="-name poc1@127.0.0.1 -setcookie abc" iex -S mix phx.server
  • Start node 2 in another terminal : $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
  • Send the following request $ http -v http://localhost:5001/api cmd=login from a third terminal.

In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:

Terminal 1
    ... 
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc1@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [info] [libcluster:example] connected to :"poc2@127.0.0.1"
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}, {POC.DReg, :"poc2@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}, {POC.DSup, :"poc2@127.0.0.1"}]
(3) [info] POST /api
    [debug] Processing with POCWeb.Login.exec/2
      Parameters: %{"cmd" => "login"}
      Pipelines: [:api]
(4) [debug] Login.exec()
(5) [debug] Login.exec(): start_child->res={:ok, #PID<21019.478.0>}
    [info] Sent 200 in 99ms
  1. The BEAM is booting
  2. Detection of the second node joining the cluster
  3. Reception of the HTTP request
  4. Start of the login controller. The controller calls Horde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}}) which will call CA.start_link()
  5. End of the login controller
Terminal 2
    ...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc2@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(4) [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(5) [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.478.0>}
    [debug] CA.start_link(): process started
  1. The BEAM is booting
  2. Detection of the second node joining the cluster
  3. CA.start_link called by the login controller. Calls GenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
  4. CA.init() called by GenServer.start_link() at step 3.
  5. Back in CA.start_link(). The process was created on N2

Killing and restarting the nodes

From now on, we will

  1. Kill the node on which the CA is running
  2. Restart that node
  3. Observe the logs
  4. Go to step 1.

Kill N2 (as the CA is running on N2)

Terminal 1
(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.494.0>}
    [debug] CA.start_link(): process started
  1. Detection of N2 failure. Cluster reorganization
  2. Restart of the CA.

Restart of node 2

Terminal 2
(1) $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
(2) Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1]
    ...
    Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(3)    [debug] NodeListener.handle_info(:nodeup)
    iex(poc2@127.0.0.1)1> [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
  1. Starting node N2
  2. BEAM boots and the application starts
  3. Cluster reorganization. Same actually happens on node N1

Kill N1 (as the CA had been restarted on N1)

Terminal 2
(1) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.484.0>}
    [debug] CA.start_link(): process started
  1. Process for the CA is restarted on node N2

Restart of node 1

The cluster gets reorganized as before

Kill N2

This where the strange thing will happen. Look at step 3.

Terminal 1
(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.482.0>}
    [debug] CA.start_link(): process started
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:error, {:already_started, #PID<0.482.0>}}
    [debug] CA.start_link(): error: {:error, {:already_started, #PID<0.482.0>}}. process not started
  1. Detection of the failure of node N2. Cluster re-organization
  2. The CA gets restarted. This is normal. The following 3 lines show that the process is restarts successfully
  3. CA.start_link() get called a second time. As the process had already been started by step 2. the error :already_started is returned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant