Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improve agent observability #78188

Open
5 of 13 tasks
mostlyjason opened this issue Sep 22, 2020 · 8 comments
Open
5 of 13 tasks

[Fleet] Improve agent observability #78188

mostlyjason opened this issue Sep 22, 2020 · 8 comments
Labels
design Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mostlyjason
Copy link
Contributor

mostlyjason commented Sep 22, 2020

Summary of the problem
We'd like to improve the observability for agents so that operators have better insights into problems and have enough information to troubleshoot and fix them in a timely manner. Additionally, the most insight we can share with users to fix issues on their own, the less often they will get stuck and need to file a support issue.

Potential scope, PM will need to better define it:

User stories*

  • As a Fleet user, I'd like to have better visibility to the health status of the agent and all the integrations running on it so I can identify problems.
  • As a Fleet user, I'd like to have better visibility to logs from the agent to troubleshoot and fix errors and other problems in a timely manner.
  • As a Fleet user, I'd like to have better visibility to metrics from the agent to troubleshoot and fix performance and capacity problems in a timely manner.

List known (technical) restrictions and requirements

Other
PM Lead @mukeshelastic
Design lead @hbharding
Collaborators @mostlyjason

@mostlyjason
Copy link
Contributor Author

@mukeshelastic I filed this design issue for planning purposes. Please review and update as desired.

@katrin-freihofner
Copy link
Contributor

@mostlyjason it says here "...Potential scope, PM will need to better define it..." when do you think this issue will be ready to be picked up?

@mostlyjason
Copy link
Contributor Author

@mukeshelastic is the PM lead for this issue so I'll defer to him.

I believe some parts are ready such as including the logstream component on the agent details page #77189

@mukeshelastic
Copy link

@hbharding and I discussed the two buckets in which we will need design support:

  1. Researching and validating problems in agent observability with few user interviews.
  2. Exploring and designing experiences we want to build for the MVP prioritized problems.

@hbharding hbharding self-assigned this Oct 26, 2020
@ravikesarwani
Copy link
Contributor

#81872

@hbharding
Copy link
Contributor

hbharding commented Oct 28, 2020

Small update: per @mukeshelastic + @ravikesarwani, we want to scope the initial work for this ticket in #81872 and treat this issue more as an ongoing epic that will extend beyond 7.11.

cc @mostlyjason @ph @katrin-freihofner

@jen-huang jen-huang removed the v7.11.0 label Apr 27, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 27, 2021
@jen-huang jen-huang added the Team:Fleet Team label for Observability Data Collection Fleet team label Apr 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Apr 27, 2021
@mtojek
Copy link
Contributor

mtojek commented Jan 20, 2022

We had an offline conversation with @joshdover around improvements.

There is a noticeable amount of SDH issues coming, which end up with a root cause, or one of the possible causes, like proxy connectivity issues. The customer has to dive into logs to figure out if the used proxy operates properly (whether connections are established, no 503s, etc.).

I believe we could more proactive and verify the connectivity between Agent and Elasticsearch, Agent and Fleet Server. I was thinking about a special technical policy first to verify all connections and settings, but maybe we can start with picking up the elastic-agent install feedback.

It would definitely help with researching customer problems ("Has your proxy ever worked?" vs "Is there an proxy outage now?").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

8 participants