agent-mode loads output from policy #3411

michel-laterman · 2024-03-27T23:15:42Z

What is the problem this PR solves?

When running in agent-mode fleet-server can never connect to any additional Elasticsearch hosts.

From the discussions in the initial attempt to fix this issue in the elastic-agent (elastic/elastic-agent#4473) we have decided to treat the output config that the agent passes to the fleet-server more like a "bootstrap" block that the fleet-server can use to form an initial connection to ES and pull more output config from the associated policy.

This approach does not solve for the edge-case where the URL used in bootstrapping (specified when enrolling/installing) goes down and the fleet-server needs to be restarted.
In this case fleet-server will not be healthy as it can't gather output config from the bootstrap ES host.
Additionally we still expect that the bootstrap es host will never be removed as a valid output.

How does this PR solve the problem?

Allow the self-monitor that fleet-server starts when in agent mode to send a config.Config struct through the server's config channel. This struct only has the Output and (new) RevisionIdx attributes set from values retrieved from the policy output.
When fleet receives new config it will handle the output only config as a special case and merge it with the previous
output config in order to get an up to date complete config.
When merging config, non-default values from the policy are preferred.
For an output block to be used, at least one host must be reachable.

How to test this PR locally

Enroll an elastic-agent instance running fleet-server and add another host to the elasticsearch output in fleet (ui)

the new host does not need to be a valid elasticsearch host

Gather a diagnostics bundle from the elastic-agent
The fleet-server.yml in the bundle should have multiple hosts in the output block, logs in the bundle should also indicate that the output block from the latest revision is being used

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
~~I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.~~
~~I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.~~

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes Fleet Server configuration does not contain all the hosts available in the Elasticsearch cluster elastic-agent#2784

michel-laterman · 2024-04-02T14:52:27Z

buildkite test this

michel-laterman · 2024-04-02T20:49:16Z

buildkite test this

michel-laterman · 2024-04-03T02:01:28Z

internal/pkg/server/fleet_integration_test.go

@@ -379,6 +381,7 @@ func TestServerConfigErrorReload(t *testing.T) {
 		cancel()
 	}).Return(nil)
 	mReporter.On("UpdateState", client.UnitStateStopping, mock.Anything, mock.Anything).Return(nil)
+	mReporter.On("UpdateState", client.UnitStateFailed, mock.MatchedBy(func(err error) bool { return errors.Is(err, context.Canceled) }), mock.Anything).Return(nil).Maybe()


I found that without this in the mocked calls list the test will panic because the cancel() call when the state updates to healthy interrupts something and returns the context cancelled error triggering a fail-state detection instead of a stopping call

nchaulet · 2024-04-03T06:44:12Z

internal/pkg/config/output_test.go

+	}}
+	for _, tc := range tests {
+		t.Run(tc.name, func(t *testing.T) {
+			res := MergeElasticsearchFromPolicy(cfg, tc.pol)


@michel-laterman why do we need to merge the policy here? cannot use what come from the config? the proxy and tls settings should be configured there too no? otherwise there is no way for user to change proxy and tls from Fleet right?

Currently the elastic-agent injection will not replace keys that are already present in local config:

https://github.com/elastic/elastic-agent/blob/2873f66e40fbbb490dc1367cb4cf2c9ccc937ae3/internal/pkg/agent/application/fleet_server_bootstrap.go#L67-L77

So the agent can set tls.CAs from policy if it was not an enrollment arg (i'm not sure if it serializes locally)
but the bigger issue is that the agent only ever sends a single host in the output block due to this behaviour

changelog/fragments/1712108631-Use-policy-outputs-when-running-in-agent-mode.yaml

juliaElastic · 2024-04-03T09:03:12Z

changelog/fragments/1712108631-Use-policy-outputs-when-running-in-agent-mode.yaml

+  Fleet-server will retrieve and use the output from the policy when running in agent-mode.
+  This allows the fleet-server to connect to multiple Elasticsearch hosts if it is successful when
+  connecting to the host provided at enrolment/installation.
+  We expect that the host provided during enrollment/installation is never removed as a valid output.


Should we raise an ingest-docs issue to document this feature?

yes, we should

internal/pkg/config/config.go

internal/pkg/config/output.go

internal/pkg/policy/self.go

cmacknz · 2024-04-11T20:13:30Z

This approach does not solve for the edge-case where the URL used in bootstrapping (specified when enrolling/installing) goes down and the fleet-server needs to be restarted.
In this case fleet-server will not be healthy as it can't gather output config from the bootstrap ES host.
Additionally we still expect that the bootstrap es host will never be removed as a valid output.

As long as agent has received a policy from Fleet server at least once, it will have persisted it locally on disk. This should mean that except for the very first time agent checks in, there is a cached set of fleet server outputs that are going to arrive via the control protocol.

Are the control protocol outputs also added to the set of Fleet server hosts when they are receive? They are the same as the ones Fleet Server could query itself, they just come in via a different path.

Removing the bootstrap broker is interesting, it might make sense to model the bootstrap broker as an explicit thing in Fleet if somebody asks us for this. Its purpose is to get the policy the very first time before there are policy outputs to read from. At the same time, once you have gotten the policy at least once, there shouldn't ever be a need to rely only on the bootstrap output again.

cmacknz · 2024-04-11T20:15:01Z

Modelling the bootstrap broker explicitly in diagnostics as something separate from the outputs in the policy might also be a way to account for it diagnostics.

michel-laterman · 2024-04-12T18:30:31Z

@cmacknz, just to be clear I have not adjusted any bootstrap behaviour in fleet-server or the elastic-agent.
I've just allowed the policy self-monitor to send updates to the config reload channel

We know that the (elastic-agent) component modifier adjusts the es hosts it sends to fleet-server to use only what is specified during enrollment when elastic-agent is not restarted if the desciption of agent persistency is correct:

As long as agent has received a policy from Fleet server at least once, it will have persisted it locally on disk. This should mean that except for the very first time agent checks in, there is a cached set of fleet server outputs that are going to arrive via the control protocol.

So we can have an edge case that occurs as follows:

ES hosts are set to A, B
fleet-server is enrolled with host A
- policy monitor will send A, B to reload
fleet-server policy is adjusted by user (i.e., max_agent has changed)
elastic-agent will send updated policy with only host A (if the agent has not been restarted)
- policy monitor will then sent A, B to reload again

I'm going to test what output is sent to a fleet-server after an agent restarts
My plan is to run elasticsearch + kibana locally, then enroll an agent running fleet-server and explicitly pass the es ca.
After it is healthy i'll adjust the output to have a different path to the (same) ca and an additional es host then restart the agent

cmacknz · 2024-04-12T18:47:21Z

We know that the (elastic-agent) component modifier adjusts the es hosts it sends to fleet-server to use only what is specified during enrollment when elastic-agent is not restarted if the description of agent persistency is correct:

Ugh right we are overriding the outputs in the policy. We'd need fleet server to cache the outputs list to get around this.

I don't want to blow up the scope of this change too much, but I wonder if all of this would be simpler if we defined Fleet server to always have two outputs, one is explicitly the bootstrap output and the other is the set of standard outputs in the policy (which may overlap with bootstrap). The bootstrap output only gets used if the standard outputs don't exist yet.

That way you'd benefit from the agent's cache of the most recent policy. Basically we'd want agent to keep injecting an output it would just be an additional output instead of overwriting the single output that exists today.

michel-laterman · 2024-04-12T22:27:57Z

I tried the following for a test:

Created 8.14-SNAPSHOT on qa
Built an agent using the main branch (did not include fleet-server from this pr)

enrolled a new fleet-server/agent instance in fleet using a custom policy and default output with --es-certificate-authorities=/etc/ssl/certs/ca-certificates
copied ca file to another location (/tmp)
added ssl.certificate_authorities: ["/tmp/ca-certificates.crt"] to es output in kibana

the new ssl.certificate_authorities value did not appear in fleet-server.yml from any diagnostics i collected (before/after restarting the agent)

I also then tried uninstalling the agent, then re-installing without the --es-certificate-authorities flag and found that fleet-server's output includes

        tls:
            verification_mode: full
            certificate_authorities:
                - /tmp/ca-certificates.crt
            renegotiation: never

and the ca path appears in fleet-server's "initial server configuration" message.

The main finding is that the elastic-agent will never replace an attribute that was present during enrolment with a value from a policy when sending config to fleet-server.
As a result we still are going to have the existing edge cases around restarts

michel-laterman · 2024-04-12T22:43:24Z

buildkite test this

michel-laterman · 2024-04-15T19:15:40Z

internal/pkg/policy/self.go

+}
+
+// getFleetOutputName returns the output name that the fleet-server input of the policy uses
+func getFleetOutputName(p *model.Policy) (string, bool) {


@juliaElastic I've added this method so we use the use_output attribute of the fleet-server input to get the output name directly instead of scanning for the 1st elasticsearch type output

elastic-sonarqube · 2024-04-15T19:29:44Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
69.7% 69.7% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

lucabelluccini · 2024-04-15T21:06:46Z

Hello!

Would it be possible to cover/describe the behavior of the ES hosts for Fleet Server integration in the documentation especially if there is some subtle consequence on first enrollment?
Can we make sure the EA diagnostic reflect the real hosts being used?
Possibly, describe the new behavior in the known issue https://support.elastic.dev/knowledge/view/f69b6e46

michel-laterman · 2024-04-15T21:55:49Z

@cmacknz I created #3464 to handle output bootstrap tracking separately to limit how large this pr grows. Should I make another issue in fleet server to discuss/track my findings above?

The main finding is that the elastic-agent will never replace an attribute that was present during enrolment with a value from a policy when sending config to fleet-server.
As a result we still are going to have the existing edge cases around restarts

@lucabelluccini I can confirm that the fleet-server.yml will be accurate in collected diagnostics.
I'll add a section that outlines expected behaviour for 8.14+ in our support docs
I'll also work with our docs team to get our public docs updated as well.

greicefaustino · 2024-04-16T11:13:41Z

Hey team, in which version we will have this fixed? There's a tentative release version? Are there other dependencies to be merged?

cmacknz · 2024-04-16T19:00:17Z

8.14.0 would be the first version with the fix.

This reverts commit fe7955b.

This reverts commit fe7955b. (cherry picked from commit 6768f25)

lucabelluccini · 2024-04-24T16:22:46Z

PLEASE NOTE: THIS FIX HAS BEEN REVERTED IN #3496
KB updated to reflect the old behavior https://support.elastic.co/knowledge/f69b6e46

This reverts commit fe7955b. (cherry picked from commit 6768f25) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>

michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels Mar 27, 2024

michel-laterman force-pushed the agent-output-bootstrap branch from 61bf7e9 to b510690 Compare March 28, 2024 23:45

michel-laterman force-pushed the agent-output-bootstrap branch 3 times, most recently from 5600e30 to 9732cbb Compare April 2, 2024 19:55

michel-laterman force-pushed the agent-output-bootstrap branch from 9732cbb to bb7c713 Compare April 3, 2024 01:34

michel-laterman changed the title ~~wip~~ agent-mode loads output from policy Apr 3, 2024

michel-laterman force-pushed the agent-output-bootstrap branch from bb7c713 to 9d08739 Compare April 3, 2024 01:51

michel-laterman added 4 commits April 2, 2024 18:52

wip

1bf87b1

Change approach to have self monitor send an incomplete config struct

9bbe8ca

Add tests

e46ab6c

Attempt to fix integration test

6c615be

michel-laterman force-pushed the agent-output-bootstrap branch from 9d08739 to 6c615be Compare April 3, 2024 01:52

michel-laterman marked this pull request as ready for review April 3, 2024 01:52

michel-laterman requested a review from a team as a code owner April 3, 2024 01:52

michel-laterman requested review from nchaulet, kaanyalti and juliaElastic April 3, 2024 01:53

michel-laterman commented Apr 3, 2024

View reviewed changes

nchaulet reviewed Apr 3, 2024

View reviewed changes