Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Expose processes and their metrics #24788

Merged
merged 30 commits into from
Apr 9, 2021

Conversation

michalpristas
Copy link
Contributor

What does this PR do?

Added /processes and /processes/{processID} endpoints to http server.

agent has its server on unix:///tmp/elastic-agent/elastic-agent.sock or npipe:///elastic-agent for windows.
not configurable

example of /processes

{
	"processes": [{
		"id": "filebeat-default-monitoring",
		"pid": "8025",
		"binary": "filebeat",
		"source": {
			"kind": "internal",
			"outputs": ["default"]
		}
	}, {
		"id": "metricbeat-default-monitoring",
		"pid": "8043",
		"binary": "metricbeat",
		"source": {
			"kind": "internal",
			"outputs": ["default"]
		}
	}, {
		"id": "metricbeat-default",
		"pid": "7998",
		"binary": "metricbeat",
		"source": {
			"kind": "configured",
			"outputs": ["default"]
		}
	}]
}

example of /processes/metricbeat-default

{
	"beat": {
		"cgroup": {
			"cpu": {
				"cfs": {
					"period": {
						"us": 100000
					},
					"quota": {
						"us": 0
					}
				},
				"id": "user.slice",
				"stats": {
					"periods": 0,
					"throttled": {
						"ns": 0,
						"periods": 0
					}
				}
			},
			"cpuacct": {
				"id": "user.slice",
				"total": {
					"ns": 994162833024
				}
			},
			"memory": {
				"id": "user.slice",
				"mem": {
					"limit": {
						"bytes": 9223372036854771712
					},
					"usage": {
						"bytes": 1766760448
					}
				}
			}
		},
		"cpu": {
			"system": {
				"ticks": 150,
				"time": {
					"ms": 156
				}
			},
			"total": {
				"ticks": 220,
				"time": {
					"ms": 232
				},
				"value": 220
			},
			"user": {
				"ticks": 70,
				"time": {
					"ms": 76
				}
			}
		},
		"handles": {
			"limit": {
				"hard": 1048576,
				"soft": 1024
			},
			"open": 17
		},
		"info": {
			"ephemeral_id": "aad52edf-4229-4927-bb30-c67ce9934499",
			"uptime": {
				"ms": 26728
			}
		},
		"memstats": {
			"gc_next": 16893216,
			"memory_alloc": 14370808,
			"memory_sys": 75056128,
			"memory_total": 34212120,
			"rss": 85266432
		},
		"runtime": {
			"goroutines": 58
		}
	},
	"libbeat": {
		"config": {
			"module": {
				"running": 4,
				"starts": 4,
				"stops": 0
			},
			"reloads": 1,
			"scans": 1
		},
		"output": {
			"events": {
				"acked": 0,
				"active": 0,
				"batches": 0,
				"dropped": 0,
				"duplicates": 0,
				"failed": 0,
				"toomany": 0,
				"total": 0
			},
			"read": {
				"bytes": 0,
				"errors": 0
			},
			"type": "elasticsearch",
			"write": {
				"bytes": 0,
				"errors": 0
			}
		},
		"pipeline": {
			"clients": 4,
			"events": {
				"active": 35,
				"dropped": 0,
				"failed": 0,
				"filtered": 0,
				"published": 35,
				"retry": 44,
				"total": 35
			},
			"queue": {
				"acked": 0
			}
		}
	},
	"metricbeat": {
		"system": {
			"cpu": {
				"events": 3,
				"failures": 0,
				"success": 3
			},
			"filesystem": {
				"events": 12,
				"failures": 0,
				"success": 12
			},
			"memory": {
				"events": 3,
				"failures": 0,
				"success": 3
			},
			"network": {
				"events": 17,
				"failures": 0,
				"success": 17
			}
		}
	},
	"system": {
		"cpu": {
			"cores": 4
		},
		"load": {
			"1": 0.58,
			"15": 1.43,
			"5": 1.43,
			"norm": {
				"1": 0.145,
				"15": 0.3575,
				"5": 0.3575
			}
		}
	}
}

in case of error e.g

{
	"type": "UNEXPECTED",
	"reason": "failed fetching metrics: Get \"http://unix/stats\": dial unix /tmp/elastic-agent/default/metricbeat/metricbeat.sock: connect: no such file or directory"
}

Why is it important?

Fixes: #24091

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@michalpristas michalpristas added enhancement needs_backport PR is waiting to be backported to other branches. Team:Ingest Management Team:Elastic-Agent Label for the Agent team labels Mar 26, 2021
@michalpristas michalpristas self-assigned this Mar 26, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 26, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 26, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Started by user Michal Pristas

  • Start Time: 2021-04-08T18:09:50.933+0000

  • Duration: 146 min 15 sec

  • Commit: f83d7b7

Test stats 🧪

Test Results
Failed 0
Passed 46934
Skipped 5132
Total 52066

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 46934
Skipped 5132
Total 52066

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @michalpristas ! Checked out your PR and it generally works as described. Left some minor comments.

The processes endpoints need to be exposed via a TCP port though, so that the information can be queried from other containers via http request. The port needs to be configurable. It's fine to do this in a follow-up PR, but it is a requirement for cloud to be able to collect the information (required for 7.13).

)

const (
procuctIDKey = "processID"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR uses product, program and process at different places for logic dealing with processes. For consistency reasons, I think we should aim for always using process, which reduces the mental overhead.

metricsBytes, metricsErr := processMetrics(r.Context(), id)
if metricsErr != nil {
resp := errResponse{
Type: "UNEXPECTED",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used a couple of times, maybe worth introducing a typeUnexpected

@michalpristas
Copy link
Contributor Author

i will make it exposed using configurable TCP in this PR, this is still a draft made in a way so you can pick it up as soon as possible.

@simitt is it ok to make it configurable in a way that when configuration is missing it wont be exposed so i dont use ports when i dont need to? is cloud capable of updating this port before agent start? or do you need static port from the start

@ruflin
Copy link
Member

ruflin commented Mar 30, 2021

+1 on having it disabled by the default. Will this be configurable through Fleet or not? The current preference would be that the person running Elastic Agent can set this and not necessarily available in Fleet.

@michalpristas
Copy link
Contributor Author

as is now it's not configurable from fleet

@michalpristas
Copy link
Contributor Author

@simitt updated solution with HTTP endpoint, processes wont be exposed unless Port is specified in agent.monitoring.port config option.

As we did not received any feedback from cloud just yet, only port can be specified and it is exposed without SSL.

@simitt
Copy link
Contributor

simitt commented Mar 31, 2021

thanks @michalpristas; I'll give it a try as soon as possible

@ruflin
Copy link
Member

ruflin commented Mar 31, 2021

I set this up on port 81.

  • For metricbeat monitoring and non monitoring, I get the info I was looking for

For http://localhost:81/processes/filebeat-default-monitoring I got

{"type":"UNEXPECTED","reason":"failed fetching metrics: Get \"http://unix/stats\": dial unix /tmp/elastic-agent/default/filebeat/filebeat.sock: connect: no such file or directory"}

I was a bit surprised that / returns a 404. Maybe we could add some basic info here?

Every time the /processes endpoint is reloaded, the order of the output changes. Not a deal breaker but it confused me at first as I thought the content changed.

# # process stats are exposed only using this option
# # it is up to a caller to make sure port is usable and free to use.
# # by default 0 is used meaning socket is used instead.
# port: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use the same config blocks as we have in beats: https://github.com/elastic/beats/blob/master/filebeat/filebeat.reference.yml#L2507

We need a host to decide if it should only be exposed on localhost or broader. This also allows to add the enabled option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, at least for when it is run in docker localhost would not be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with http.enabled/host/port options

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. @simitt I assume on Cloud we can just use this in the template and make the port configurable or hardcode.

@michalpristas michalpristas marked this pull request as ready for review March 31, 2021 09:40
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

# # When using IP addresses, it is recommended to only use localhost.
# host: localhost
# # Port on which the HTTP endpoint will bind. Default is 0 meaning feature is disabled.
# port: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have enabled / disabled now, do we still need the support for 0 ?

return
}

fmt.Fprint(w, string(metricsBytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I normally like is that metrics endpoint are also human readable. What I mean in the context here is that we pretty print the json. Unfortunately this means in this context to convert it first to json to be able to pretty print it with indentation. At the same time, should not cause too much overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you specify pretty print with metricbeat? we can pass argument to mb if passed to agent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember there is a ?pretty flag: /stats/?pretty. Not sure if that works over the socket. I was initially thinking to implement it here so it works for all the outputs also agent, but not strong preference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pretty was a special flag provided by the expvar handler. Not sure we have had it implemented in Beats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this is not a blocker, please ignore it for now.

@ruflin
Copy link
Member

ruflin commented Apr 6, 2021

@michalpristas How do I reach the metric data from elastic-agent itself?

@michalpristas
Copy link
Contributor Author

michalpristas commented Apr 6, 2021

/stats endpoint as with beats

type MonitoringHTTPConfig struct {
Enabled bool `yaml:"enabled" config:"enabled"`
Host string `yaml:"host" config:"host"`
Port int `yaml:"port" config:"port"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a 'positive' validator (see unpack) docs. Then if port is configured, but empty we will fail to parse the configuration and fail with the setting that failed. The Enabled will be used to not start the server.

func processMetrics(ctx context.Context, id string) ([]byte, int, error) {
detail, err := parseID(id)
if err != nil {
return nil, http.StatusInternalServerError, err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no internal error, but the user did provide invalid input.

return
}

fmt.Fprint(w, string(metricsBytes))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pretty was a special flag provided by the expvar handler. Not sure we have had it implemented in Beats.

blakerouse and others added 14 commits April 7, 2021 08:37
)

* Add status sub-command to report status of running daemon.

* Set exit code based on health status.

* Add changelog.

* Fix format.
* Add baseline ECS 1.9.0 upgrade

* update changelog
* feat: stage execution cache

* fix: use correct context

* fix: do not check stage status on the first run

* fix: proper URL

* chore: show message when the stache is skip

* fix: correct path

* fix: add final /

* test: is the path needed?

* fix: remove prefix

* chore: refactor to use curl to download

* chore: use pipeline step
…c#24904)

* Add check for URL set when cert and cert key.

* Add changelog.
return nil, 0, errorWithStatus(http.StatusInternalServerError, err)
}

return rb, resp.StatusCode, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can StatusCode be != 200 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont know, but i would rather proxy whatever is retrieved from beat than mix 200 with error message

@michalpristas michalpristas merged commit 9625db6 into elastic:master Apr 9, 2021
michalpristas added a commit to michalpristas/beats that referenced this pull request Apr 12, 2021
[Ingest Manager] Expose processes and their metrics (elastic#24788)
michalpristas added a commit that referenced this pull request Apr 14, 2021
Cherry-pick #24788 to 7.x: Expose processes and their metrics  (#25017)
v1v added a commit to v1v/beats that referenced this pull request Apr 14, 2021
* upstream/master: (308 commits)
  [winlogbeat] Add support for sysmon v13 events 24 and 25 (elastic#24945)
  mergify: add backport label (elastic#25050)
  Add pod.ip in k8s metadata (elastic#25037)
  [elastic-agent] Use fleet.url for container cmd (elastic#25026)
  disable TestXPackEnabled flaky test in logstash metricbeat module (elastic#25034)
  Leverege leader election in agent  k8s manifests (elastic#25016)
  libbeat/publisher/pipeline: expand monitoring (elastic#24700)
  libbeat: fix decode_json_fields config validation (elastic#24862)
  Remove make docs-preview instructions (elastic#25001)
  [Filebeat] Fix IPtables pipeline (elastic#24928)
  [DOCS] cd into correct directory before invoking mage. (elastic#17679)
  Add -buildmode=pie for supported platform (elastic#24964)
  Add agent's direcotry in k8s manifest generator (elastic#24987)
  [mergify] assign the original author (elastic#25007)
  Fix AWS module flaky tests (elastic#24852)
  [filebeat] Use fail_on_template_error on google_workspace and okta pagination (elastic#24967)
  Updated config to match defaults (elastic#25004)
  [Filebeat] Fix hardcoded amazonaws.com endpoint (elastic#24861)
  Add cloud.service.name to add_cloud_metadata (elastic#24993)
  [Ingest Manager] Expose processes and their metrics (elastic#24788)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exposing Data from Agent and APM Server
9 participants