Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch plugin: add a tag for node role #2158

Closed
animageofmine opened this issue Dec 14, 2016 · 13 comments · Fixed by #6064
Closed

elasticsearch plugin: add a tag for node role #2158

animageofmine opened this issue Dec 14, 2016 · 13 comments · Fixed by #6064

Comments

@animageofmine
Copy link

animageofmine commented Dec 14, 2016

Bug report

We have a cluster of 9 nodes in elasticsearch. 5 data nodes, 3 master and 1 client. We use KairosDB for storing telegraf data and Grafana for graphs. One of the problems we are facing is to group metrics by role (master, client or data). However, it looks like each node in elasticsearch cluster returns payload for whole cluster. For example, if I want to monitor JVM Heap (mem_heap_used_in_bytes) for only data nodes, I can't seem to find a way to do that because each node returns JVM Heap for all the nodes that includes data, master and client nodes (because each node is cluster aware via Zen Discovery).

Not sure if I am doing anything wrong here or my understanding is incorrect, but I wanted to check if there is way to deal with this problem (I really hope I am doing something silly). Please see telegraf.conf below

Relevant telegraf.conf:

[agent]
  hostname = "<OneoftheNodesInESCluster"
  interval = "30s"
  round_interval = true
  metric_buffer_limit = 1000
  flush_buffer_when_full = true
  collection_jitter = "1s"
  flush_interval = "30s"
  flush_jitter = "5s"
  debug = false
  quiet = false

OUTPUTS:
[[outputs.opentsdb]]
  debug = false
  host = <somekairosdbhost>
  port = 4244
  prefix = "telegraf."

INPUTS:
[[inputs.docker]]
  interval = "2m"
  timeout = "30s"
[[inputs.elasticsearch]]
  cluster_health = true
  servers = ["http://localhost:9200"]
[[inputs.statsd]]
  allowed_pending_messages = 10000

System info:

Linux elasticsearchNodeData1 3.10.0-327.36.1.el7.x86_64 #1 SMP Sun Sep 18 13:04:29 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Telegraf version: Telegraf - version 1.0.0-beta3
All the nodes are dockerized with debian based build.

Steps to reproduce:

  1. Master has 8GB with 4 cores
  2. Client node has 4GB with 2 cores
  3. Data has 32GB with 16 cores

If we fetch some metric, say "mem_heap_used_in_bytes", this seems to fetch data from all the node types (data, client and master). Can't seem to find a way to isolate stats from each role.

There is something in config called "local=true", not sure what is it for".

Please let me know if you have any questions or need more info. Thank you.

@sparrc
Copy link
Contributor

sparrc commented Dec 14, 2016

If you set local = true then I believe that the plugin will only collect stats on the host specified, rather than on the entire cluster.

This plugin doesn't tag metrics by role. Unfortunately elasticsearch doesn't make the node role available via their cluster API. From what I can tell we might be able to get this info via a "cat nodes" query

@sparrc sparrc changed the title Group metrics by node role elasticsearch plugin: add a tag for node role Dec 14, 2016
@sparrc sparrc added this to the Future Milestone milestone Dec 14, 2016
@animageofmine
Copy link
Author

animageofmine commented Dec 14, 2016

Thank you for looking into this. Actually the cluster/node API does expose the role. See example below (look for roles).

BTW, would turning on local flag use a different API or the same?

"uZ8dyuLbQnG-ljlT35RQgA": {
         "timestamp": 1481673501955,
         "name": "es-datanode1",
         "transport_address": "10.10.10.26:9300",
         "host": "10.10.10.26",
         "ip": "10.10.10.26:9300",
         **"roles": [
            "data",
            "ingest"
         ],**
         "indices": {
            "docs": {
               "count": 2601120,
               "deleted": 0
            },
            "store": {
               "size_in_bytes": 3983729207,
               "throttle_time_in_millis": 0
            },

@sparrc
Copy link
Contributor

sparrc commented Dec 14, 2016

@animageofmine, that is just a blob of JSON.....where did it come from? which API? can you provide a full request/response example?

@animageofmine
Copy link
Author

animageofmine commented Dec 14, 2016

@sparrc Sure. Following is the information

Query: curl localhost:9200/_nodes/_local
Payload: I just executed on my local box since the payload from the cluster was really large to paste.

Let me know if you need more info. BTW, I can't seem to find a metric that reports cluster health status (green, yellow, red). Any idea?

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "elasticsearch",
  "nodes": {
    "ZOwb1f4DTVCQbuQpVu1jrw": {
      "name": "elk4node01",
      "transport_address": "10.2.240.172:9300",
      "host": "10.2.240.172",
      "ip": "10.2.240.172",
      "version": "5.0.1",
      "build_hash": "080bb47",
      "total_indexing_buffer": 426010214,
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "settings": {
        "pidfile": "/var/run/elasticsearch/elasticsearch.pid",
        "cluster": {
          "name": "elasticsearch"
        },
        "node": {
          "name": "elk4node01"
        },
        "path": {
          "conf": "/etc/elasticsearch",
          "data": [
            "/var/lib/elasticsearch"
          ],
          "logs": "/var/log/elasticsearch",
          "home": "/usr/share/elasticsearch"
        },
        "client": {
          "type": "node"
        },
        "http": {
          "type": {
            "default": "netty4"
          }
        },
        "transport": {
          "type": {
            "default": "netty4"
          }
        },
        "network": {
          "host": "0.0.0.0",
          "publish_host": "10.2.240.172"
        }
      },
      "os": {
        "refresh_interval_in_millis": 1000,
        "name": "Linux",
        "arch": "amd64",
        "version": "4.4.27-moby",
        "available_processors": 4,
        "allocated_processors": 4
      },
      "process": {
        "refresh_interval_in_millis": 1000,
        "id": 45,
        "mlockall": false
      },
      "jvm": {
        "pid": 45,
        "version": "1.8.0_111",
        "vm_name": "OpenJDK 64-Bit Server VM",
        "vm_version": "25.111-b14",
        "vm_vendor": "Oracle Corporation",
        "start_time_in_millis": 1481701191724,
        "mem": {
          "heap_init_in_bytes": 4294967296,
          "heap_max_in_bytes": 4260102144,
          "non_heap_init_in_bytes": 2555904,
          "non_heap_max_in_bytes": 0,
          "direct_max_in_bytes": 4260102144
        },
        "gc_collectors": [
          "ParNew",
          "ConcurrentMarkSweep"
        ],
        "memory_pools": [
          "Code Cache",
          "Metaspace",
          "Compressed Class Space",
          "Par Eden Space",
          "Par Survivor Space",
          "CMS Old Gen"
        ],
        "using_compressed_ordinary_object_pointers": "true"
      },
      "thread_pool": {
        "force_merge": {
          "type": "fixed",
          "min": 1,
          "max": 1,
          "queue_size": -1
        },
        "fetch_shard_started": {
          "type": "scaling",
          "min": 1,
          "max": 8,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "listener": {
          "type": "fixed",
          "min": 2,
          "max": 2,
          "queue_size": -1
        },
        "index": {
          "type": "fixed",
          "min": 4,
          "max": 4,
          "queue_size": 200
        },
        "refresh": {
          "type": "scaling",
          "min": 1,
          "max": 2,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "generic": {
          "type": "scaling",
          "min": 4,
          "max": 128,
          "keep_alive": "30s",
          "queue_size": -1
        },
        "warmer": {
          "type": "scaling",
          "min": 1,
          "max": 2,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "search": {
          "type": "fixed",
          "min": 7,
          "max": 7,
          "queue_size": 1000
        },
        "flush": {
          "type": "scaling",
          "min": 1,
          "max": 2,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "fetch_shard_store": {
          "type": "scaling",
          "min": 1,
          "max": 8,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "management": {
          "type": "scaling",
          "min": 1,
          "max": 5,
          "keep_alive": "5m",
          "queue_size": -1
        },
        "get": {
          "type": "fixed",
          "min": 4,
          "max": 4,
          "queue_size": 1000
        },
        "bulk": {
          "type": "fixed",
          "min": 4,
          "max": 4,
          "queue_size": 50
        },
        "snapshot": {
          "type": "scaling",
          "min": 1,
          "max": 2,
          "keep_alive": "5m",
          "queue_size": -1
        }
      },
      "transport": {
        "bound_address": [
          "[::]:9300"
        ],
        "publish_address": "10.2.240.172:9300",
        "profiles": {}
      },
      "http": {
        "bound_address": [
          "[::]:9200"
        ],
        "publish_address": "10.2.240.172:9200",
        "max_content_length_in_bytes": 104857600
      },
      "plugins": [
        {
          "name": "repository-s3",
          "version": "5.0.1",
          "description": "The S3 repository plugin adds S3 repositories",
          "classname": "org.elasticsearch.plugin.repository.s3.S3RepositoryPlugin"
        }
      ],
      "modules": [
        {
          "name": "aggs-matrix-stats",
          "version": "5.0.1",
          "description": "Adds aggregations whose input are a list of numeric fields and output includes a matrix.",
          "classname": "org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin"
        },
        {
          "name": "ingest-common",
          "version": "5.0.1",
          "description": "Module for ingest processors that do not require additional security permissions or have large dependencies and resources",
          "classname": "org.elasticsearch.ingest.common.IngestCommonPlugin"
        },
        {
          "name": "lang-expression",
          "version": "5.0.1",
          "description": "Lucene expressions integration for Elasticsearch",
          "classname": "org.elasticsearch.script.expression.ExpressionPlugin"
        },
        {
          "name": "lang-groovy",
          "version": "5.0.1",
          "description": "Groovy scripting integration for Elasticsearch",
          "classname": "org.elasticsearch.script.groovy.GroovyPlugin"
        },
        {
          "name": "lang-mustache",
          "version": "5.0.1",
          "description": "Mustache scripting integration for Elasticsearch",
          "classname": "org.elasticsearch.script.mustache.MustachePlugin"
        },
        {
          "name": "lang-painless",
          "version": "5.0.1",
          "description": "An easy, safe and fast scripting language for Elasticsearch",
          "classname": "org.elasticsearch.painless.PainlessPlugin"
        },
        {
          "name": "percolator",
          "version": "5.0.1",
          "description": "Percolator module adds capability to index queries and query these queries by specifying documents",
          "classname": "org.elasticsearch.percolator.PercolatorPlugin"
        },
        {
          "name": "reindex",
          "version": "5.0.1",
          "description": "The Reindex module adds APIs to reindex from one index to another or update documents in place.",
          "classname": "org.elasticsearch.index.reindex.ReindexPlugin"
        },
        {
          "name": "transport-netty3",
          "version": "5.0.1",
          "description": "Netty 3 based transport implementation",
          "classname": "org.elasticsearch.transport.Netty3Plugin"
        },
        {
          "name": "transport-netty4",
          "version": "5.0.1",
          "description": "Netty 4 based transport implementation",
          "classname": "org.elasticsearch.transport.Netty4Plugin"
        }
      ],
      "ingest": {
        "processors": [
          {
            "type": "append"
          },
          {
            "type": "convert"
          },
          {
            "type": "date"
          },
          {
            "type": "date_index_name"
          },
          {
            "type": "dot_expander"
          },
          {
            "type": "fail"
          },
          {
            "type": "foreach"
          },
          {
            "type": "grok"
          },
          {
            "type": "gsub"
          },
          {
            "type": "join"
          },
          {
            "type": "json"
          },
          {
            "type": "lowercase"
          },
          {
            "type": "remove"
          },
          {
            "type": "rename"
          },
          {
            "type": "script"
          },
          {
            "type": "set"
          },
          {
            "type": "sort"
          },
          {
            "type": "split"
          },
          {
            "type": "trim"
          },
          {
            "type": "uppercase"
          }
        ]
      }
    }
  }
}

@Akshaykapoor
Copy link

+1
It'll be good to have node_roles value in tags.

@sybrandy
Copy link

The /_nodes/stats endpoint has the roles as well. It gets the information for all of the nodes in the cluster.

@MatthewOHaraTR
Copy link
Contributor

@sparrc Part of the problem is that the parser was set up to ignore strings and only process numeric data as metrics. In my recent merge that you accepted, I added the capability for the plugin to get the string data too (I needed it in the new API calls I was making). But I kept the node stats unchanged to avoid changing the behavior for anyone else. Maybe another plugin option could be used to control this behavior?

@sparrc
Copy link
Contributor

sparrc commented Jan 21, 2017

we don't need to add a config option to add a tag to the metrics

@danielnelson danielnelson removed this from the Future Milestone milestone Jun 14, 2017
@eesprit
Copy link
Contributor

eesprit commented Mar 28, 2018

Node role as a tag would be really useful, what is actually blocking for adding it ?
Need a PR ?

@danielnelson
Copy link
Contributor

@eesprit Yes, I think we just need a PR

@cyberaa
Copy link

cyberaa commented May 14, 2018

+1
Would be a great addition to have, since roles are more common now.

dupondje pushed a commit to dupondje/telegraf that referenced this issue Jul 2, 2019
This adds node_roles as a tag to the exported elasticsearch metrics.
For example:
node_roles=master\,data\
@dupondje
Copy link
Contributor

dupondje commented Jul 2, 2019

Pushed a possible fix in the commit above (tests not adjusted yet).
Is it this kind of output we want? Cause a node can have multiple roles, so we need a tag with multiple values here.

I don't know if we want an option to include/exclude them? (And what default do we use)?

@danielnelson
Copy link
Contributor

Is it this kind of output we want? Cause a node can have multiple roles, so we need a tag with multiple values here.

I don't like it, but I think this is our only/best option. Would be really nice if we could send multiple values for a tag. My only suggestion is to sort the roles in the list so they will be in a stable order.

@danielnelson danielnelson added this to the 1.12.0 milestone Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants