Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent non-Windows host - Agent doesn't finish installing until after re-starting the service #150

Closed
EricDavisX opened this issue Jul 29, 2021 · 25 comments
Labels
8.5-candidate bug Something isn't working QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@EricDavisX
Copy link
Contributor

Hi this is a spawn off of testing done in support of
elastic/beats#26665

  • testing done with 7.14 BC4 Agent and Cloud based stack

I'm transferring this issue from the Endpoint team, to Beats / Agent.

From @dikshachauhan-qasource : we have attempted to validate the endpoint behavior on French VM machines and found it working fine with a small glitch.

Observations:
Scenario1:
Installed agent under a policy having endpoint.

Agent remained in updating state till we manually restart the elastic-agent service.
Host then updated to healthy status and was available under Endpoint tab with status 'success'.
Data streams were working fine.
All binaries were in running state.
Recording:
https://user-images.githubusercontent.com/12970373/127567223-9c1fd3ee-4216-4837-b0a6-2d6cb45d0300.mp4

Scenario2:
Unenrolled then Re-Installed agent under same policy having endpoint.

Observations same as mentioned above.
Scenario3:
Unenrolled then Re-Installed agent under Default policy. Later after installation of agent, we added Endpoint security.

Observations same as mentioned above.
screenshot:
windows-10-french

Logs.zip:
logs-french-win-10-agent.zip

@EricDavisX EricDavisX added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jul 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/agent (Team:Agent)

@andresrc andresrc added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 14, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@andresrc andresrc removed the Team:Elastic-Agent Label for the Agent team label Oct 14, 2021
@andresrc
Copy link
Contributor

There were some fixes related to i18n. Can this be tested again?

@EricDavisX
Copy link
Contributor Author

@dikshachauhan-qasource @amolnater-qasource thanks for the help, can you re-test and post back?

@dikshachauhan-qasource
Copy link

Hi @EricDavisX

Due to Vsphere issue today, we were unable to validate this. However, we will reattempt it again and will share observation accordingly.

Thanks
QAS

@dikshachauhan-qasource
Copy link

Hi @EricDavisX

We have re-attempted to validate the current behavior of agent with endpoint on non-english VM and found same observations as shared above.

  • Agent is stuck in updating state until user manually restarts the agent service.

Build details:
Version : 7.15.1 GA
Artifact link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.15.1-windows-x86_64.zip
BUILD 44185
COMMIT c1aa1ef8dc38b511ce4a647fe92ca0348aadd834

Screenshot:
image

image

Please let me know if info is required from our end.

Thanks
QAS

@EricDavisX
Copy link
Contributor Author

I realize now the testing we did was after the expected fixes - I am not surprised here. @andresrc @jlind23 if you think this is expected world-wide usage, we could prioritize the research? At least Michal has Endgame cluster vSphere access to a non-English host we can spin up for him if desired / needed. and we can get more folks access as needed to spread the work-load, let me know in email.

@jlind23
Copy link
Contributor

jlind23 commented Oct 19, 2021

@EricDavisX did you check that the build number was with the related fixes? If yes, then it seems that we missed some things. @andresrc would you mind giving more context here? English hosts works fine?

@EricDavisX
Copy link
Contributor Author

@jlind23 hi - glad to confirm, the 7.15.1 build is intended to have the fixes. Perhaps if we review the backport we'd find a problem? original pr is elastic/beats#26665

This is a testing hole we were lucky to find (Alvaro found it for us!) and after much review, we found code that needed to reference a translated-to-different-language word, in this case the folder on the host is called 'administradores' instead of 'administrators' in Windows so the code was updated. It was believed to be working, but QA reported this immediately after the fixes went into full testing and so we can follow up again. There is a pair of hosts in the Endgame vSphere cluster that are prepped for testing if we want them. IT helpdesk can grant access to you or others as needed.

@jlind23
Copy link
Contributor

jlind23 commented Oct 19, 2021

@EricDavisX saw another thread in which blake was saying that maybe the version was not downloaded properly. May it be related to this? #148

@EricDavisX
Copy link
Contributor Author

Thanks. Assigning to @amolnater-qasource to follow up and see if we can reproduce, and if so if we can isolate the downloading failure or prove it is something else.

@amolnater-qasource
Copy link

Hi @EricDavisX

We have revalidated this issue by installing elastic-agent on French VM and found this issue still reproducible.

  • Agent is stuck in updating state until user manually restarts the agent service.

Kibana/Build details:

7.15.1 Kibana cloud environment.
Build: 44185
Commit: c1aa1ef8dc38b511ce4a647fe92ca0348aadd834

We observed below [error] log line for metricbeat under logs tab:
13:52:33.532 elastic_agent.metricbeat [elastic_agent.metricbeat][error] Error dialing dial tcp 127.0.0.1:9200: connectex: Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée.
(English translation: No connection could be made because the target computer expressly refused it.)

Debug level logs:
logs.zip

Note:
Elastic-agent artifact was downloaded completely without any errors.
10

cc: @jlind23
Please let me know if anything else is required.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Oct 21, 2021

Port 9200 is used for elasticsearch communication isn't it? @amolnater-qasource is your elasticsearch pod running?

@blakerouse
Copy link
Contributor

Looks like a configuration issue as metricbeat cannot communicate with elasticsearch. Did you configure the output settings correctly in the Fleet UI?

@EricDavisX
Copy link
Contributor Author

In the description it is cited as a 7.14 BC4 cloud stack, I'm betting the test didn't include changing anything in the fleet output settings. I'm unassigning Amol, seems like it needs an engineering owner

@amolnater-qasource
Copy link

Hi @jlind23

@amolnater-qasource is your elasticsearch pod running?

We were testing upon 7.15.1 cloud production environment and even agents installed Healthy on other OS's other than non-English Windows host.
So yes it should be running as the issue is only observed for non-english Windows host.

@blakerouse

Looks like a configuration issue as metricbeat cannot communicate with elasticsearch. Did you configure the output settings correctly in the Fleet UI?

Do we need to do any separate settings to install agent on non-English Windows host?
As we followed the steps we usually follow to install the elastic-agent as per guide: Install a Fleet-managed Elastic Agent

  • Downloaded and extracted the elastic-agent artifact.
  • From the extracted folder we executed Install command available on Fleet UI.

Please let us know if anything else is required from our end.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Oct 25, 2021

As @blakerouse is out maybe @michalpristas could give us pointers about non-English Windows host configuration.

@jlind23
Copy link
Contributor

jlind23 commented Jan 18, 2022

@amolnater-qasource i'm digging hold issues. Is this something still relevant?

@amolnater-qasource
Copy link

amolnater-qasource commented Jan 19, 2022

Hi @jlind23
We have revalidated this issue on latest 8.0 Snapshot Kibana cloud environment.
We have attempted this on 03 different languages Windows VM consisting:

  • French Windows 10
  • Spanish Windows 10
  • Korean Windows 10

No issues are observed on Spanish and Korean VM

  • Agent installed directly Healthy and no need of restarting service is observed.

However this issue is still reproducible on French VM.

  • French VM elastic-agent stuck in updating state and required a manual restart of service.

Screenshot:

19

Build details:
BUILD: 49040
COMMIT: 155e06787e48de9a8de4345d86a826e95edf32ec
Artifact Link: https://snapshots.elastic.co/8.0.0-129ef708/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-windows-x86_64.zip

Please let us know if anything else is required from our end.
Thanks

@jlind23 jlind23 transferred this issue from elastic/beats Mar 7, 2022
@amolnater-qasource
Copy link

Hi @jlind23
We have revalidated this issue on latest 8.4 Snapshot Kibana cloud environment and found it still reproducible.

  • French VM elastic-agent stuck in updating state and required a manual service restart to get Healthy.

Screenshots:
18
19

Build details:
BUILD: 54160
COMMIT: b509d2466e88bf6c4386d8dd5fe89b5c8a54a97f
Artifact Link: https://snapshots.elastic.co/8.4.0-0384b1d2/downloads/beats/elastic-agent/elastic-agent-8.4.0-SNAPSHOT-windows-x86_64.zip

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jul 6, 2022

@amolnater-qasource do we have logs saying why the agent is stuck?

@mdelapenya do you know if we can select the OS language in the e2e testing framework?

@amolnater-qasource
Copy link

Hi @jlind23
Please find below attached logs for agent installed on French Windows10 VM.

Logs:
elastic-agent-diagnostics-2022-07-07T10-16-38Z-00.zip

Further we have observed below error logs for elastic-agent.metricbeat dataset.

15:37:50.574
elastic_agent.metricbeat
[elastic_agent.metricbeat][error] Error fetching data for metricset system.filesystem: error getting filesystem usage for D:\: GetDiskFreeSpaceEx failed: Le périphérique n’est pas prêt.
15:37:52.198
elastic_agent.metricbeat
[elastic_agent.metricbeat][error] Failed to connect to backoff(elasticsearch(http://127.0.0.1:9200)): Get "http://127.0.0.1:9200": dial tcp 127.0.0.1:9200: connectex: Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée.
15:37:54.269
elastic_agent.metricbeat
[elastic_agent.metricbeat][error] Error dialing dial tcp 127.0.0.1:9200: connectex: Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée.

Screenshot:
9

Internet on this VM is working fine.

Please let us know if anything else is required from our end.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jul 7, 2022

@amolnater-qasource looking at the logs it seems that Elasticsearch is unreachable on its 9200 port.
Moreover another issue related to this particular filesystem not being ready: D:: GetDiskFreeSpaceEx

Does it ring a bell?

@mdelapenya
Copy link
Contributor

@mdelapenya do you know if we can select the OS language in the e2e testing framework?

No, we cannot, as we have a Windows 2019 AMI pre-baked with Packer to be used as the target instance for the E2E. I've just commented in elastic/e2e-testing#2805 (comment) that I'd suggest having manual tests on the language support

@amolnater-qasource
Copy link

Hi @jlind23

We have revalidated this issue on 8.6.0 release kibana cloud-production environment and found it fixed now.

Host:
French Windows 10

Observations:

  • French VM elastic-agent gets installed Healthy and required no manual restart.

Build details:
BUILD: 58852
COMMIT: d3a625ef4a6e611a5b3233a1ce5cbe8ef429eb47
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.6.0-windows-x86_64.zip

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-01-12.15-08-50.mp4

Hence we are closing this issue and marking it as QA:Validated.

Thanks

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.5-candidate bug Something isn't working QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

8 participants