Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client not able to reconnect instantly when a node gets down then up again with multi-node configuration #1714

Open
pgayvallet opened this issue Jun 15, 2022 · 11 comments

Comments

@pgayvallet
Copy link

pgayvallet commented Jun 15, 2022

🐛 Bug Report

When configured with multiple ES nodes, if all nodes get offline (then online again) at the same time, the client takes time before being able to successfully reconnect

To Reproduce

Using a local 2 node ES cluster (es version 8.2 not that is really matters):

es-node-1 config

cluster.name: test-cluster
node.name: node-1
xpack.security.enabled: false
http.port: 9201
cluster.initial_master_nodes: ["node-1", "node-2"]

es-node-2 config

cluster.name: test-cluster
node.name: node-2
xpack.security.enabled: false
http.port: 9202
cluster.initial_master_nodes: ["node-1", "node-2"]

test script

const elasticsearch = require('@elastic/elasticsearch');

const client = new elasticsearch.Client({
  sniffOnStart: false,
  sniffOnConnectionFault: false,
  sniffInterval: false,
  nodes: ['http://localhost:9201', 'http://localhost:9202'],
});

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

const ping = async () => {
  try {
    await client.ping();
    console.log('ping successful');
  } catch (e) {
    console.log('error during ping: ', e.message);
  }
};

const chainPing = async () => {
  await ping();
  await delay(1000);
  chainPing();
}

chainPing();

Scenarios

1. stop node 1, then restart node 1

All ok, all calls successful

2. stop node 2, then restart node 2

All ok, all calls successful

3. stop node 1, stop node 2, restart node 1, restart node 2

  • When both nodes are down, we see errors error during ping: There are no living connections (which is expected)
  • Once both nodes get up again, the errors remain for a given period of time for all calls (which should not occur)
  • From my tests, this period seems to be approximatively 6 minutes

4. stop node 1, restart node 1, stop node 2, restart node 2

stopping and restarting node 1 causes no problem, however as soon as the second node is stopped, we observe the same behavior as the previous scenario: the Client does not retry to connect to node1 instantly, and need approx 6minutes before doing so

Expected behavior

When node goes down and up again, the client should be able to reconnect to them as soon as they're up and running again, especially if all nodes are considered down

FWIW, this is working as expected for a single-node configuration. When the client is configured with a single node, when the node is down, we get a slightly different error (connect ECONNREFUSED 127.0.0.1:9201), and the client is able to communicate with ES again as soon as the node is back on.

I'm assuming this is because the Client is implementing an eviction strategy and an eviction period in case of down node when configured with multi-node. I also assume the strategy don't handle the case where all nodes are down at the same time, where it should more actively try to reconnect to at least one node.

Your Environment

  • node version: 16.14.2
  • @elastic/elasticsearch version: reproduced on 8.0.0 and elasticsearch-canary@8.2.0-canary.2
  • os: Mac
@pgayvallet pgayvallet added the bug label Jun 15, 2022
@pgayvallet pgayvallet changed the title Client not able to reconnect instantly if all nodes get down then up at the same time. Client not able to reconnect instantly when a node gets down then up again with multi-node configuration Jun 16, 2022
@pgayvallet
Copy link
Author

Issue reproduced on v8.2.0 and v8.0.0, and not reproduced on 7.17.0. This was likely introduced by the internal changes to the transport layer during 8.0.0

@pgayvallet
Copy link
Author

pgayvallet commented Jun 16, 2022

WeightedConnectionPool seems to be the culprit here.

Looking at the code, ClusterConnectionPool, which is the old (7.x) implementation, has an internal mechanism to try to resurrect down nodes when calling getConnection.

WeightedConnectionPool on the other hand, does not have such mechanism.

I unfortunately don't have enough understanding of the code to understand how the nodes are put back alive when using WeightedConnectionPool, or where this approximative 6mins duration is coming from

EDIT: I also see that the tests for the weighted pool with dead nodes are commented:

https://github.com/elastic/elastic-transport-js/blob/a95f72658be4a77f4c023c5484b64e7bb8ea4421/test/unit/weighted-connection-pool.test.ts#L99-L107

Which may indicate that's there is indeed a problem here. Also worth noting that there's no test asserting the behavior when all nodes are dead.

@pgayvallet
Copy link
Author

Sample test reproducing the issue

  t.test('3 Connections, same weight', t => {
    const pool = new WeightedConnectionPool({ Connection: HttpConnection })
    pool.addConnection([
      'http://localhost:9200/',
      'http://localhost:9201/',
      'http://localhost:9202/'
    ]);

    for (let i = 0; i < 30; i++) {
      const con = pool.getConnection(opts);
      if(con) {
        pool.markDead(con)
      }
    }

    const order: string[] = [];

    try {
      for (let i = 0; i < 100; i++) {
        const connection = pool.getConnection(opts);
        order.push(connection ? connection.id : 'none');
      }
      console.log(order)
      t.pass('Distribution is ok')
    } catch (err: any) {
      t.error(err)
    }

    t.end()
  })

output:

test/unit/weighted-connection-pool.test.ts 1> [
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none', 'none', 'none', 'none', 'none', 'none',
test/unit/weighted-connection-pool.test.ts 1>   'none', 'none'
test/unit/weighted-connection-pool.test.ts 1> ]

This test passes though:

  t.test('3 Connections, same weight', t => {
    const pool = new WeightedConnectionPool({ Connection: HttpConnection })
    pool.addConnection([
      'http://localhost:9200/',
      'http://localhost:9201/',
      'http://localhost:9202/'
    ]);

    for (let i = 0; i < 30; i++) {
      pool.markDead(pool.connections[0]);
      pool.markDead(pool.connections[1]);
      pool.markDead(pool.connections[2]);
    }

    const order: string[] = [];

    try {
      for (let i = 0; i < 100; i++) {
        const connection = pool.getConnection(opts);
        order.push(connection ? connection.id : 'none');
      }
      console.log(order)
      t.pass('Distribution is ok')
    } catch (err: any) {
      t.error(err)
    }

    t.end()
  })

as it outputs

test/unit/weighted-connection-pool.test.ts 1> [
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/', 'http://localhost:9201/', 'http://localhost:9202/',
test/unit/weighted-connection-pool.test.ts 1>   'http://localhost:9200/'
test/unit/weighted-connection-pool.test.ts 1> ]

So there's an interaction by alternating the getConnection and markDead calls. Note that this is what happen in an actual production scenario, as the transport requests a connection, fails the call with it because the node is down, and then flag the connection as dead

@pgayvallet
Copy link
Author

I then ran

  t.test('3 Connections, same weight', t => {
    const pool = new WeightedConnectionPool({ Connection: HttpConnection })
    pool.addConnection([
      'http://localhost:9200/',
      'http://localhost:9201/',
      'http://localhost:9202/',
      'http://localhost:9203/'
    ]);

    for (let i = 0; i < 10; i++) {
      const con = pool.getConnection(opts);
      if(con) {
        pool.markDead(con)
      }
    }

    let count = 0;

    try {
      for (let i = 0; i < 10000; i++) {
        const connection = pool.getConnection(opts);
        if(connection) {
          break
        }
        count++;
      }
      console.log(`count: ${count}`);
      t.pass('Distribution is ok')
    } catch (err: any) {
      t.error(err)
    }

    t.end()
  })

Making the number of initial failing calls, and the number of nodes, vary to see the amount of getConnection calls required before the pool actually returns a connection to a node.

The results:

2 nodes - 20 failing calls - count: 395
2 nodes - 30 failing calls - count: 385
2 nodes - 40 failing calls - count: 375

3 nodes - 10 failing calls - count: 252
3 nodes - 20 failing calls - count: 242
3 nodes - 30 failing calls - count: 232
3 nodes - 40 failing calls - count: 222

4 nodes - 10 failing calls - count: 0 (!)
4 nodes - 20 failing calls - count: 169
4 nodes - 30 failing calls - count: 159
4 nodes - 40 failing calls - count: 149

@TinaHeiligers
Copy link
Contributor

4 nodes - 10 failing calls - count: 0 (!)

seems strange, considering at first glance the fewer the number of failing calls, the higher the count for 2 and 3 nodes.

@pgayvallet
Copy link
Author

Yea, I have no idea why exactly, maybe there's a minimum amount of calls to have depending on number of nodes, or something. We could try with more nodes to see, but at this point, I think we collected enough element for a maintainer to pick that issue up eventually.

@TinaHeiligers
Copy link
Contributor

I think we collected enough element for a maintainer to pick that issue up eventually.

Agreed.

@jodevsa
Copy link

jodevsa commented Jul 29, 2022

I've opened an issue describing the main problem and the root cause here: elastic/elastic-transport-js#53

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove the stale label, or leave a comment, or this will be closed in 14 days.

Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove the stale label, or leave a comment, or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Feb 13, 2024
@pgayvallet pgayvallet removed the stale label Feb 18, 2024
@JoshMock JoshMock self-assigned this Jun 7, 2024
@JoshMock JoshMock removed their assignment Jul 29, 2024
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove the stale label, or leave a comment, or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Oct 29, 2024
@JoshMock JoshMock removed the stale label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants