Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change local fleet-server connection to localhost:8221 #1867

Merged
merged 4 commits into from
Dec 8, 2022

Conversation

michel-laterman
Copy link
Contributor

What does this PR do?

Fix an issue where the local fleet-server port was not properly used by the elastic-agent when running an instance of fleet-server.

Why is it important?

When running a large scale of agents, the fleet-server may hit the connection limit if it's running on the same port as the other agents. this causes the elastic-agent running the fleet-server to be marked as degraded.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Fix an issue where the local fleet-server port was not properly used by
the elastic-agent when running an instance of fleet-server.
@michel-laterman michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-v8.5.0 Automated backport with mergify backport-v8.6.0 Automated backport with mergify labels Dec 1, 2022
Comment on lines +305 to +307
if c.options.FleetServer.InternalPort == 0 {
c.options.FleetServer.InternalPort = defaultFleetServerInternalPort
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternalURL is set when the agent is creating the fleet tls settings (line 321), however it expects a non-zero internal port value in order to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine. we have a check like this inside createFleetServerBootstrapConfig i'm ok with moving this up.

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 1, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-12-06T18:04:24.233+0000

  • Duration: 17 min 6 sec

Test stats 🧪

Test Results
Failed 0
Passed 4661
Skipped 13
Total 4674

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 1, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.333% (59/60) 👍
Files 69.082% (143/207) 👍
Classes 69.133% (271/392) 👍
Methods 53.988% (819/1517) 👍
Lines 39.154% (8868/22649) 👍 0.016
Conditionals 100.0% (0/0) 💚

@michel-laterman michel-laterman marked this pull request as ready for review December 2, 2022 00:49
@michel-laterman michel-laterman requested a review from a team as a code owner December 2, 2022 00:49
@michel-laterman michel-laterman requested review from aleksmaus and removed request for a team December 2, 2022 00:49
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this with 2 agents
one running fleet server, other one enrolling with this fleet server.
please make sure this scenario works

@michel-laterman
Copy link
Contributor Author

I've tested what port is used by adding a local debug line in remote/client.go

When I enroll a new agent running the fleet-server I see that the enroll request targets 8220

{"log.level":"debug","@timestamp":"2022-12-05T16:25:47.701-0800","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"REMOTE DEBUG request URL https://Michels-MacBook-Pro.local:8220/api/fleet/agents/enroll?","ecs.version":"1.6.0"}

However, when the agent is running afterwards it switches to 8221:

{"log.level":"debug","@timestamp":"2022-12-06T00:49:28.169Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"REMOTE DEBUG request URL https://localhost:8221/api/fleet/agents/30295209-b53d-4aba-9c40-cf8c3ef38091/checkin?","ecs.version":"1.6.0"}

Enrolling another agent always targets 8221:

{"log.level":"debug","@timestamp":"2022-12-06T00:52:55.565Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"REMOTE DEBUG request URL https://192.168.4.20:8220/api/fleet/agents/a29b0ae9-f172-48da-bc69-a0b650fb3e08/checkin?","ecs.version":"1.6.0"}

@michalpristas
Copy link
Contributor

However, when the agent is running afterwards it switches to 8221:

is this for remote agent? remote agent should not call this port just the one running fleet-server.

@michel-laterman
Copy link
Contributor Author

Oops, sorry I was not clear;

The local agent does the swap from 8220 to 8221 when enrolling/running.
A remote agent always uses 8220

@michalpristas
Copy link
Contributor

just to clarify initial implementation and why we have this.
first let's distinguish internal and external communication we have here. internal is agent which is running this particular fleet server, external are any other enrolled agents
when we use single port for both internal and external communication, it can happen that agent which is running fleet server is throtled on TooManyRequests because it shares the resources with all other agents. In rare occasions this can lead to agent running fleet server being forcefully unenrolled because multiple checkins were dropped and Unenrollment timeout has passed. When this happens we lose all the agents enrolled to this particular fleet server.

So we introduced internal port which is used only by agent running fleet server. this prevents throttling and force unenrollment of all agents due to heavy load.

@sonarcloud
Copy link

sonarcloud bot commented Dec 6, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
7.6% 7.6% Duplication

@michel-laterman
Copy link
Contributor Author

After a brief discussion with @michalpristas we have decided to keep the behaviour of the local agent that this PR introduces: using 8220 to enrol and 8221 when running normally as it should not run into load issues during the bootstrap process.
If required we may be able to change the enrol port for the bootstrap process in the future

@jlind23
Copy link
Contributor

jlind23 commented Dec 8, 2022

@michel-laterman can we proceed with this change or is there anything else missing?

@michel-laterman michel-laterman merged commit 8c7537b into elastic:main Dec 8, 2022
@michel-laterman michel-laterman deleted the fleet-port branch December 8, 2022 17:21
mergify bot pushed a commit that referenced this pull request Dec 8, 2022
* Change local fleet-server connection to localhost:8221

Fix an issue where the local fleet-server port was not properly used by
the elastic-agent when running an instance of fleet-server.

* Fix typo

* Add additional debug line in remote client

* change to certificate verfication for local port

(cherry picked from commit 8c7537b)

# Conflicts:
#	internal/pkg/agent/cmd/enroll_cmd.go
mergify bot pushed a commit that referenced this pull request Dec 8, 2022
* Change local fleet-server connection to localhost:8221

Fix an issue where the local fleet-server port was not properly used by
the elastic-agent when running an instance of fleet-server.

* Fix typo

* Add additional debug line in remote client

* change to certificate verfication for local port

(cherry picked from commit 8c7537b)
michel-laterman added a commit that referenced this pull request Dec 8, 2022
* Change local fleet-server connection to localhost:8221

Fix an issue where the local fleet-server port was not properly used by
the elastic-agent when running an instance of fleet-server.

* Fix typo

* Add additional debug line in remote client

* change to certificate verfication for local port

(cherry picked from commit 8c7537b)

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.5.0 Automated backport with mergify backport-v8.6.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants