Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to run opensearch-benchmark [OSB] execute-test against OSB provisioned cluster due to failing health check #380

Closed
cgchinmay opened this issue Sep 29, 2023 · 12 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@cgchinmay
Copy link
Collaborator

Describe the bug
opensearch-benchmark execute test cannot run as the health check with provisioned cluster doesn't succeed

Tried running opensearch-benchmark command execute-test as below

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=geonames --target-host=127.0.0.1:9200 --test-mode --kill-running-processes

But the above command gets stuck and checking the logs I see failing health checks

tail -f ~/.benchmark/logs/benchmark.log
2023-09-28 23:07:35,325 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.013s]
2023-09-28 23:08:05,341 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.015s]
2023-09-28 23:08:38,359 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.017s]
2023-09-28 23:09:54,761 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:76.402s]

To Reproduce
Provision the cluster using opensearch-benchmark pypi utility

opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet
{
  "installation-id": "7bedc677-82d4-48e8-866c-bf62250eca9d"
}

Started Single Node Cluster using opensearch benchmark

opensearch-benchmark start --installation-id=7bedc677-82d4-48e8-866c-bf62250eca9d --test-execution-id=benchmark

Validated the the cluster status

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "_na_",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Now when I try to access cluster health endpoint, it returns error with 503 status code

curl "localhost:9200/_cluster/health?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_manager_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "cluster_manager_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Expected behavior
Health status API should return success and opensearch-benchmark execute-test should run successfully

Logs
If applicable, add logs to help explain your problem.

-not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO Test Execution id [6ecd84cc-a9c1-448c-9349-1e75928755f0]
2023-09-28 23:04:02,180 -not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO User specified pipeline [benchmark-only].
2023-09-28 23:04:02,181 -not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO Using configured hosts [{'host': '127.0.0.1', 'port': 9200}]
2023-09-28 23:04:02,182 -not-actor-/PID:31923 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-09-28 23:04:02,186 ActorAddr-(T|:1900)/PID:31932 osbenchmark.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1695942242144'}] match requirements [{'coordinator': True}].
2023-09-28 23:04:32,261 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.012s]
2023-09-28 23:05:02,276 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.014s]

More Context (please complete the following information):

  • Workload(Share link for custom workloads): geonames
  • Service(E.g OpenSearch) Opesnsearch
  • Version (E.g. 1.0) 2.10.0

Additional context
Add any other context about the problem here.

@cgchinmay cgchinmay added bug Something isn't working untriaged labels Sep 29, 2023
@gkamat gkamat added good first issue Good for newcomers and removed untriaged labels Sep 29, 2023
@IanHoang
Copy link
Collaborator

IanHoang commented Oct 2, 2023

@cgchinmay OSB won't be able to start the test (unless we skip the cluster health check which we don't recommend) because this is an issue with the cluster that was setup. This can be seen from how the logs show a 503 status (error on the server side)

2023-09-28 23:07:35,325 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.013s]

This can also be confirmed from when you curl the cluster

curl "localhost:9200/_cluster/health?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_manager_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "cluster_manager_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Do you have only one cluster running or did you provision others on your local host as well? I recommend checking to see if there are any other clusters you might have provisioned.

@cgchinmay
Copy link
Collaborator Author

cgchinmay commented Oct 3, 2023

@IanHoang I retried above steps and there is a problem with cluster provisioned using OSB. The health checks keep failing which prevents from running test

However I was able to execute test against a cluster provisioned by me using docker compose instructions as given here.

I also checked that we do have a process listening on port 9300 and 9200 after provisioning cluster with OSB

lsof -i :9300        
COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    38919 chinmay  609u  IPv6 0x45b963dcff4dad67      0t0  TCP localhost:vrace (LISTEN)

For your reference, here is the output of stats api for cluster provisioned with OSB. Have stripped the output of unnecessary details.

curl -X GET "http://localhost:9200/_cluster/stats?pretty"
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "_na_",
  "timestamp" : 1696293835154,
  "indices" : {
    "count" : 0,
    "shards" : { },
    "docs" : {
      "count" : 0,
      "deleted" : 0
    },
    ...
  },
  "nodes" : {
    "count" : {
      "total" : 1,
      "cluster_manager" : 1,
      "coordinating_only" : 0,
      "data" : 1,
      "ingest" : 1,
      "master" : 1,
      "remote_cluster_client" : 1,
      "search" : 0
    },
    "versions" : [
      "2.10.0"
    ],
    "os" : {
      "available_processors" : 8,
      "allocated_processors" : 8,
      "names" : [
        {
          "name" : "Mac OS X",
          "count" : 1
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Mac OS X",
          "count" : 1
        }
      ],
      "mem" : {
        "total_in_bytes" : 8589934592,
        "free_in_bytes" : 106086400,
        "used_in_bytes" : 8483848192,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : 613,
        "max" : 613,
        "avg" : 613
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 408743,
      "versions" : [
        {
          "version" : "17.0.8",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.8+7",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 147504128,
        "heap_max_in_bytes" : 1073741824
      },
      "threads" : 32
    },
    "fs" : {
      "total_in_bytes" : 245107195904,
      "free_in_bytes" : 159130832896,
      "available_in_bytes" : 159130832896,
      "cache_reserved_in_bytes" : 0
    },
    "plugins" : [
     ....
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

@AkshathRaghav
Copy link
Contributor

I was facing the same issue with the cluster setup and followed the instructions given by @rishabh6788 in Slack. Posting it here for reach.

  1. Download the zip/tar.gz from https://opensearch.org/downloads.html
  2. Extract it, go into the opensearch folder and open config/opensearch.yml file.
  3. Add following settings to it discovery.type: single-node and plugins.security.disabled: true then save and close it.
  4. Run the opensearch-install.bat script (for linux it's opensearch-tar-install.sh) inside the opensearch folder.
  5. Check the output of curl.exe "http://localhost:9200/_cluster/health?pretty". Output should be similar to the below:
{
  "cluster_name" : "<name>",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
  1. opensearch-benchmark execute-test --pipeline=benchmark-only --workload=geonames --target-host=127.0.0.1:9200 --test-mode --workload-params '{"number_of_shards":"1","number_of_replicas":"0"}' <- Run this command to try geonames out!

@cgchinmay
Copy link
Collaborator Author

cgchinmay commented Oct 12, 2023

Will take another look at this and update

@cgchinmay
Copy link
Collaborator Author

cgchinmay commented Oct 25, 2023

Thanks @AkshathRaghav for the steps, I was able to see it working. So I looked into OSB provisioned cluster and its using following opensearch.yml file and this file does not have discovery.type: single-node set by default. So I updated the yaml file to include this setting, but then the cluster doesn't get provisioned at all and I don't see any error in logs either.

cc: @rishabh6788 , @IanHoang any suggestion on how to debug this ?

Here is the updated yaml file

cat ~/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/install/opensearch-2.10.0/config/opensearch.yml 
# ======================== OpenSearch Configuration =========================
#
# NOTE: OpenSearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: benchmark-provisioned-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: osb-node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: ['/Users/chinmay/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/install/opensearch-2.10.0/data']
#
# Path to log files:
#
path.logs: /Users/chinmay/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/logs/server
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# OpenSearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 9200

transport.tcp.port: 9300
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["127.0.0.1:9300"]
# Prevent split brain by specifying the initial master nodes.
cluster.initial_master_nodes: ["opensearch-node-1"]
discovery.type: single-node
#
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
plugins.security.disabled: true

@OVI3D0
Copy link
Member

OVI3D0 commented Aug 15, 2024

Hey @cgchinmay, I was looking into this a bit and narrowed it down to a potential problem with how OSB is populating the cluster.initial_master_nodes setting in the opensearch.yml config file.

Right now if I hardcode the node name before provisioning and starting the cluster using the commands you provided:
cluster.initial_master_nodes: ["osb-node-1"]

I get healthy responses from the cluster:

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "rtOEAzCRR0G-aLRjTE791g",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

and

curl "http://localhost:9200/_cluster/health?pretty" 
{
  "cluster_name" : "benchmark-provisioned-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And I can run the benchmark as intended.
cc: @IanHoang @gkamat

@cgchinmay
Copy link
Collaborator Author

@OVI3D0 that's an interesting find. Will try this out

@OVI3D0
Copy link
Member

OVI3D0 commented Aug 15, 2024

For a little more context, this came after investigating the logs under
~/.benchmark/benchmarks/test_executions/<installation-id>/osb-node-1/logs/server
after replicating your issue, and seeing the following warning:

tail -f benchmark-provisioned-cluster.log 
[2024-08-15T17:09:04,353][WARN ][o.o.c.c.ClusterFormationFailureHelper] [osb-node-1] cluster-manager not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover cluster-manager-eligible nodes [opensearch-node-1] to bootstrap a cluster: have discovered [{osb-node-1}{gULLiOVDS565rl8liTw5pw}{puMy3v0nS3KZIbZz_huXuQ}{127.0.0.1}{127.0.0.1:9300}{dimr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [] from hosts providers and [{osb-node-1}{gULLiOVDS565rl8liTw5pw}{puMy3v0nS3KZIbZz_huXuQ}{127.0.0.1}{127.0.0.1:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Hey @cgchinmay, I was looking into this a bit and narrowed it down to a potential problem with how OSB is populating the cluster.initial_master_nodes setting in the opensearch.yml config file.

Right now if I hardcode the node name before provisioning and starting the cluster using the commands you provided: cluster.initial_master_nodes: ["osb-node-1"]

I get healthy responses from the cluster:

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "rtOEAzCRR0G-aLRjTE791g",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

and

curl "http://localhost:9200/_cluster/health?pretty" 
{
  "cluster_name" : "benchmark-provisioned-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And I can run the benchmark as intended. cc: @IanHoang @gkamat

@IanHoang
Copy link
Collaborator

Great find @OVI3D0! Are there any follow-up action items we can do to streamline this for our users (e.g. help text when users encounter the issue that Chinmay provided above or automatically set the cluster.initial_master_nodesto "osb-node-1" by default)?

@gkamat
Copy link
Collaborator

gkamat commented Aug 16, 2024

Isn't the OpenSearch distribution installation command provided in the description incorrect?

opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet

The master node specification should match the node name. Changing it to osb-node-1 appears to get the cluster working properly. Perhaps @OVI3D0 can check if further action is needed. Thanks.

@OVI3D0
Copy link
Member

OVI3D0 commented Aug 19, 2024

Isn't the OpenSearch distribution installation command provided in the description incorrect?

opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet

The master node specification should match the node name. Changing it to osb-node-1 appears to get the cluster working properly. Perhaps @OVI3D0 can check if further action is needed. Thanks.

You're right, this worked for me as well. If these values need to match, then I think we can streamline this by adding a check before building the cluster, to make sure they always match: #621

@cgchinmay
Copy link
Collaborator Author

cgchinmay commented Aug 26, 2024

I will close this issue. Here is the right command based on above observations. The node name and master nodes name must match

opensearch-benchmark  install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="osb-node-1" --seed-hosts="127.0.0.1:9300"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants