Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #22793 to 7.x: [Ingest Management] Agent expose metrics #23105

Merged
merged 2 commits into from
Dec 14, 2020

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Dec 14, 2020

Cherry-pick of PR #22793 to 7.x branch. Original message:

What does this PR do?

Using system package focused on agent process we are collecting CPU,disk and memory metrics which are sent to ds.elastic_agent-elastic-agent

At first i was playing with exposing endpoint and using beat module to collect some information about agent but i let it go as most of information collected using this module is not relevant expect for go-routines and it makes code bloated with unnecessary setups providing empty values for fields which are noncollectable/unreportable from agent point of view.

Why is it important?

#22394

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Example of final doc

linux

{
  "_index": ".ds-metrics-elastic_agent.elastic_agent-default-000001",
  "_id": "1d6qYHYBIHKyMD4EYWSe",
  "_version": 1,
  "_score": null,
  "_source": {
    "@timestamp": "2020-12-14T09:52:26.506Z",
    "ecs": {
      "version": "1.7.0"
    },
    "metricset": {
      "period": 10000,
      "name": "json"
    },
    "service": {
      "address": "http://unix/stats",
      "type": "http"
    },
    "elastic_agent": {
      "snapshot": false,
      "version": "8.0.0",
      "id": "76069e50-3df1-11eb-a870-73a635b35320",
      "process": "elastic-agent"
    },
    "agent": {
      "version": "8.0.0",
      "ephemeral_id": "2198b42f-be51-45f9-acfe-0c3e47021d64",
      "id": "5ce18284-c42b-46a4-9149-5df70be787a6",
      "name": "vagrant",
      "type": "metricbeat"
    },
    "event": {
      "dataset": "elastic_agent.elastic_agent",
      "module": "http",
      "duration": 9794045
    },
    "host": {
      "architecture": "x86_64",
      "os": {
        "platform": "ubuntu",
        "version": "16.04.1 LTS (Xenial Xerus)",
        "family": "debian",
        "name": "Ubuntu",
        "kernel": "4.4.0-31-generic",
        "codename": "xenial"
      },
      "name": "vagrant",
      "id": "c0cc2a7efa902a719ada8ab6584b6bcb",
      "containerized": false,
      "ip": [
        "172.17.0.1",
      ],
      "mac": [
        "08:00:27:08:27:32",
      ],
      "hostname": "vagrant"
    },
    "data_stream": {
      "dataset": "elastic_agent.elastic_agent",
      "namespace": "default",
      "type": "metrics"
    },
    "system": {
      "process": {
        "cpu": {
          "system": {
            "ticks": 1190,
            "time": {
              "ms": 1196
            }
          },
          "total": {
            "time": {
              "ms": 4464
            },
            "value": 4450,
            "ticks": 4450
          },
          "user": {
            "ticks": 3260,
            "time": {
              "ms": 3268
            }
          }
        },
        "memory": {
          "size": 73482496
        },
        "fd": {
          "limit": {
            "hard": 4096,
            "soft": 1024
          },
          "open": 21
        },
        "cgroup": {
          "cpu": {
            "id": "elastic-agent.service",
            "stats": {
              "throttled": {
                "ns": 0,
                "periods": 0
              },
              "periods": 0
            },
            "cfs": {
              "quota": {
                "us": 0
              },
              "period": {
                "us": 100000
              }
            }
          },
          "cpuacct": {
            "id": "elastic-agent.service",
            "total": {
              "ns": 13885517853
            }
          },
          "memory": {
            "id": "elastic-agent.service",
            "mem": {
              "usage": {
                "bytes": 428773376
              },
              "limit": {
                "bytes": 9223372036854772000
              }
            }
          }
        }
      }
    }
  },
  "fields": {
    "@timestamp": [
      "2020-12-14T09:52:26.506Z"
    ]
  },
  "sort": [
    1607939546506
  ]
}

mac

{
	"_index": ".ds-metrics-elastic_agent.elastic_agent-default-000001",
	"_id": "2QxlPHYBjGFDnaF_EkU-",
	"_version": 1,
	"_score": null,
	"_source": {
		"@timestamp": "2020-12-07T08:50:24.348Z",
		"event": {
			"dataset": "elastic_agent.elastic_agent",
			"module": "http",
			"duration": 3040126
		},
		"metricset": {
			"name": "json",
			"period": 10000
		},
		"system": {
			"process": {
				"cpu": {
					"system": {
						"ticks": 1745,
						"time": {
							"ms": 1745
						}
					},
					"total": {
						"ticks": 7291,
						"time": {
							"ms": 7291
						},
						"value": 7291
					},
					"user": {
						"time": {
							"ms": 5546
						},
						"ticks": 5546
					}
				},
				"memory": {
					"size": 74531072
				}
			}
		},
		"host": {
			"mac": [
				"ac:de:48:ac:de:48"
			],
			"name": "MacBook-Pro-2.local",
			"hostname": "MacBook-Pro-2.local",
			"architecture": "x86_64",
			"os": {
				"name": "Mac OS X",
				"kernel": "18.7.0",
				"build": "18G6032",
				"platform": "darwin",
				"version": "10.14.6",
				"family": "darwin"
			},
			"id": "FC609F24-07E1-54EA-8E33-56F9D5A7A97E",
			"ip": [
				"127.0.0.2"
			]
		},
		"agent": {
			"ephemeral_id": "0cf156d9-4398-4c29-a52d-596ec7a93f5f",
			"id": "e09c86a1-f5dd-4fe8-898c-70de832e2a9e",
			"name": "MacBook-Pro-2.local",
			"type": "metricbeat",
			"version": "8.0.0"
		},
		"service": {
			"address": "http://unix/stats",
			"type": "http"
		},
		"data_stream": {
			"dataset": "elastic_agent.elastic_agent",
			"namespace": "default",
			"type": "metrics"
		},
		"elastic_agent": {
			"snapshot": false,
			"version": "8.0.0",
			"id": "02e6478a-72b9-4a5e-bd63-0f6be2ef4dba",
			"process": "elastic-agent"
		},
		"ecs": {
			"version": "1.6.0"
		}
	},
	"fields": {
		"@timestamp": [
			"2020-12-07T08:50:24.348Z"
		]
	},
	"sort": [
		1607331024348
	]
}

* [Ingest Manager] Log level reloadable from fleet (elastic#22690)

[Ingest Manager] Log level reloadable from fleet (elastic#22690)

* aa

* create drop

* updated drop

* process contains everything

* drop start time

* undo exposed endpoint

* sanitize dataset name

* ups

* agent expose http

* collect all metrics from beats

* colelct all from beats

* golint

* cleaner docs

* updated structure

* cgroup

* long live file saving issues

(cherry picked from commit 49c8d87)
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 14, 2020
@blakerouse
Copy link
Contributor Author

This PR contains #23106 for the changelog.

@blakerouse blakerouse requested a review from ph December 14, 2020 15:27
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 14, 2020

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts

Expand to view the summary

Build stats

  • Build Cause: Pull request #23105 opened

  • Reason: Aborted from #2

  • Start Time: 2020-12-14T15:25:03.368+0000

  • Duration: 63 min 53 sec

  • Commit: 95581c1

Test stats 🧪

Test Results
Failed 0
Passed 17385
Skipped 1404
Total 18789

Steps errors 1

Expand to view the steps failures

Terraform Apply on x-pack/metricbeat/module/aws
  • Took 0 min 26 sec . View more details on here

Log output

Expand to view the last 100 lines of log output

[2020-12-14T16:22:49.835Z] + git config --get user.email
[2020-12-14T16:22:49.835Z] + [ -z  ]
[2020-12-14T16:22:49.835Z] + git config --global user.email beatsmachine@users.noreply.github.com
[2020-12-14T16:22:49.835Z] + git config --global user.name beatsmachine
[2020-12-14T16:22:50.493Z] + go mod download
[2020-12-14T16:23:07.112Z] + .ci/scripts/terraform-cleanup.sh x-pack/metricbeat
[2020-12-14T16:23:07.112Z] + DIRECTORY=x-pack/metricbeat
[2020-12-14T16:23:07.112Z] + FAILED=0
[2020-12-14T16:23:07.112Z] ++ find x-pack/metricbeat -name terraform.tfstate
[2020-12-14T16:23:07.112Z] + for tfstate in $(find $DIRECTORY -name terraform.tfstate)
[2020-12-14T16:23:07.112Z] ++ dirname x-pack/metricbeat/module/aws/terraform.tfstate
[2020-12-14T16:23:07.112Z] + cd x-pack/metricbeat/module/aws
[2020-12-14T16:23:07.112Z] + terraform destroy -auto-approve
[2020-12-14T16:23:09.045Z] random_id.suffix: Refreshing state... [id=SXmhqg]
[2020-12-14T16:23:09.045Z] random_password.db: Refreshing state... [id=none]
[2020-12-14T16:23:11.039Z] aws_sqs_queue.test: Refreshing state... [id=https://sqs.********.amazonaws.com/627286350134/metricbeat-test-4979a1aa]
[2020-12-14T16:23:11.039Z] aws_db_instance.test: Refreshing state... [id=metricbeat-test-4979a1aa]
[2020-12-14T16:23:11.039Z] aws_s3_bucket.test: Refreshing state... [id=metricbeat-test-4979a1aa]
[2020-12-14T16:23:17.811Z] aws_s3_bucket_metric.test: Refreshing state... [id=metricbeat-test-4979a1aa:EntireBucket]
[2020-12-14T16:23:17.811Z] aws_s3_bucket_object.test: Refreshing state... [id=someobject]
[2020-12-14T16:23:20.371Z] aws_s3_bucket_metric.test: Destroying... [id=metricbeat-test-4979a1aa:EntireBucket]
[2020-12-14T16:23:20.371Z] aws_s3_bucket_object.test: Destroying... [id=someobject]
[2020-12-14T16:23:20.638Z] aws_sqs_queue.test: Destroying... [id=https://sqs.********.amazonaws.com/627286350134/metricbeat-test-4979a1aa]
[2020-12-14T16:23:20.639Z] aws_db_instance.test: Destroying... [id=metricbeat-test-4979a1aa]
[2020-12-14T16:23:20.909Z] aws_s3_bucket_object.test: Destruction complete after 1s
[2020-12-14T16:23:20.909Z] aws_sqs_queue.test: Destruction complete after 1s
[2020-12-14T16:23:20.909Z] aws_s3_bucket_metric.test: Destruction complete after 1s
[2020-12-14T16:23:21.182Z] aws_s3_bucket.test: Destroying... [id=metricbeat-test-4979a1aa]
[2020-12-14T16:23:21.789Z] aws_s3_bucket.test: Destruction complete after 1s
[2020-12-14T16:23:32.251Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 10s elapsed]
[2020-12-14T16:23:40.656Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 20s elapsed]
[2020-12-14T16:23:50.831Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 30s elapsed]
[2020-12-14T16:24:01.353Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 40s elapsed]
[2020-12-14T16:24:11.412Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 50s elapsed]
[2020-12-14T16:24:21.441Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m0s elapsed]
[2020-12-14T16:24:31.567Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m10s elapsed]
[2020-12-14T16:24:41.644Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m20s elapsed]
[2020-12-14T16:24:51.716Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m30s elapsed]
[2020-12-14T16:25:00.413Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m40s elapsed]
[2020-12-14T16:25:10.463Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 1m50s elapsed]
[2020-12-14T16:25:20.877Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m0s elapsed]
[2020-12-14T16:25:31.046Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m10s elapsed]
[2020-12-14T16:25:41.607Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m20s elapsed]
[2020-12-14T16:25:51.721Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m30s elapsed]
[2020-12-14T16:26:01.938Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m40s elapsed]
[2020-12-14T16:26:11.986Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 2m50s elapsed]
[2020-12-14T16:26:21.839Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m0s elapsed]
[2020-12-14T16:26:31.080Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m10s elapsed]
[2020-12-14T16:26:41.830Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m20s elapsed]
[2020-12-14T16:26:52.338Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m30s elapsed]
[2020-12-14T16:27:00.516Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m40s elapsed]
[2020-12-14T16:27:10.870Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 3m50s elapsed]
[2020-12-14T16:27:21.309Z] aws_db_instance.test: Still destroying... [id=metricbeat-test-4979a1aa, 4m0s elapsed]
[2020-12-14T16:27:22.377Z] aws_db_instance.test: Destruction complete after 4m2s
[2020-12-14T16:27:22.377Z] random_id.suffix: Destroying... [id=SXmhqg]
[2020-12-14T16:27:22.377Z] random_password.db: Destroying... [id=none]
[2020-12-14T16:27:22.377Z] random_password.db: Destruction complete after 0s
[2020-12-14T16:27:22.377Z] random_id.suffix: Destruction complete after 0s
[2020-12-14T16:27:22.377Z] 
[2020-12-14T16:27:22.377Z] Destroy complete! Resources: 7 destroyed.
[2020-12-14T16:27:22.377Z] + cd -
[2020-12-14T16:27:22.377Z] /var/lib/jenkins/workspace/Beats_beats_PR-23105/src/github.com/elastic/beats/src/github.com/elastic/beats
[2020-12-14T16:27:22.377Z] + exit 0
[2020-12-14T16:27:23.700Z] Client: Docker Engine - Community
[2020-12-14T16:27:23.700Z]  Version:           19.03.14
[2020-12-14T16:27:23.700Z]  API version:       1.40
[2020-12-14T16:27:23.700Z]  Go version:        go1.13.15
[2020-12-14T16:27:23.700Z]  Git commit:        5eb3275d40
[2020-12-14T16:27:23.700Z]  Built:             Tue Dec  1 19:20:17 2020
[2020-12-14T16:27:23.700Z]  OS/Arch:           linux/amd64
[2020-12-14T16:27:23.701Z]  Experimental:      false
[2020-12-14T16:27:23.701Z] 
[2020-12-14T16:27:23.701Z] Server: Docker Engine - Community
[2020-12-14T16:27:23.701Z]  Engine:
[2020-12-14T16:27:23.701Z]   Version:          19.03.14
[2020-12-14T16:27:23.701Z]   API version:      1.40 (minimum version 1.12)
[2020-12-14T16:27:23.701Z]   Go version:       go1.13.15
[2020-12-14T16:27:23.701Z]   Git commit:       5eb3275d40
[2020-12-14T16:27:23.701Z]   Built:            Tue Dec  1 19:18:45 2020
[2020-12-14T16:27:23.701Z]   OS/Arch:          linux/amd64
[2020-12-14T16:27:23.701Z]   Experimental:     false
[2020-12-14T16:27:23.701Z]  containerd:
[2020-12-14T16:27:23.701Z]   Version:          1.3.9
[2020-12-14T16:27:23.701Z]   GitCommit:        ea765aba0d05254012b0b9e595e995c09186427f
[2020-12-14T16:27:23.701Z]  runc:
[2020-12-14T16:27:23.701Z]   Version:          1.0.0-rc10
[2020-12-14T16:27:23.701Z]   GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
[2020-12-14T16:27:23.701Z]  docker-init:
[2020-12-14T16:27:23.701Z]   Version:          0.18.0
[2020-12-14T16:27:23.701Z]   GitCommit:        fec3683
[2020-12-14T16:27:34.117Z] Stage "Packaging" skipped due to when conditional
[2020-12-14T16:27:34.316Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-23105/src/github.com/elastic/beats
[2020-12-14T16:27:54.046Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats_PR-23105
[2020-12-14T16:27:54.267Z] [INFO] getVaultSecret: Getting secrets
[2020-12-14T16:27:54.502Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2020-12-14T16:27:56.933Z] + chmod 755 generate-build-data.sh
[2020-12-14T16:27:56.933Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-23105/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-23105/runs/1 ABORTED 3773216
[2020-12-14T16:27:56.933Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-23105/runs/1/steps/?limit=10000 -o steps-info.json
[2020-12-14T16:28:06.916Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-23105/runs/1/tests/?status=FAILED -o tests-errors.json
[2020-12-14T16:28:07.828Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-23105/runs/1/log/ -o pipeline-log.txt

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change lgtm, but the tests failed?

@ph
Copy link
Contributor

ph commented Dec 14, 2020

@v1v Have you ever see that?

[2020-12-14T15:41:25.769Z] Error: Error creating DB Instance: InstanceQuotaExceeded: DB Instance quota exceeded
[2020-12-14T15:41:25.769Z] 	status code: 400, request id: 7ce18407-a537-4d4e-ad21-75cd3730160b
[2020-12-14T15:41:25.769Z] 
[2020-12-14T15:41:25.769Z]   on terraform.tf line 18, in resource "aws_db_instance" "test":
[2020-12-14T15:41:25.769Z]   18: resource "aws_db_instance" "test" {
[2020-12-14T15:41:25.769Z] 
[2020-12-14T15:41:25.769Z] 
script returned exit code 1

@blakerouse blakerouse merged commit 380be4f into elastic:7.x Dec 14, 2020
@blakerouse blakerouse deleted the backport_22793_7.x branch December 14, 2020 19:17
@v1v
Copy link
Member

v1v commented Dec 14, 2020

Yep, that's related to the cloud testing that was enabled by default, so it seems there is a bottleneck in the number of resources.

@jsoriano pointed that the cloud testing was on demand rather by default, but when moving to the monorepo jenkinsfile.yml approach I enabled the cloud testing by default, not sure what's the best strategy here

@jsoriano
Copy link
Member

@jsoriano pointed that the cloud testing was on demand rather by default, but when moving to the monorepo jenkinsfile.yml approach I enabled the cloud testing by default, not sure what's the best strategy here

Maybe we should disable them?

[2020-12-14T15:41:25.769Z] Error: Error creating DB Instance: InstanceQuotaExceeded: DB Instance quota exceeded

This is worrying, could it be that db instances are not being properly destroyed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants