Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent-mode loads output from policy #3411

Merged
merged 24 commits into from
Apr 15, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Mar 27, 2024

What is the problem this PR solves?

When running in agent-mode fleet-server can never connect to any additional Elasticsearch hosts.

From the discussions in the initial attempt to fix this issue in the elastic-agent (elastic/elastic-agent#4473) we have decided to treat the output config that the agent passes to the fleet-server more like a "bootstrap" block that the fleet-server can use to form an initial connection to ES and pull more output config from the associated policy.

This approach does not solve for the edge-case where the URL used in bootstrapping (specified when enrolling/installing) goes down and the fleet-server needs to be restarted.
In this case fleet-server will not be healthy as it can't gather output config from the bootstrap ES host.
Additionally we still expect that the bootstrap es host will never be removed as a valid output.

How does this PR solve the problem?

Allow the self-monitor that fleet-server starts when in agent mode to send a config.Config struct through the server's config channel. This struct only has the Output and (new) RevisionIdx attributes set from values retrieved from the policy output.
When fleet receives new config it will handle the output only config as a special case and merge it with the previous
output config in order to get an up to date complete config.
When merging config, non-default values from the policy are preferred.
For an output block to be used, at least one host must be reachable.

How to test this PR locally

  1. Enroll an elastic-agent instance running fleet-server and add another host to the elasticsearch output in fleet (ui)
  • the new host does not need to be a valid elasticsearch host
  1. Gather a diagnostics bundle from the elastic-agent
  2. The fleet-server.yml in the bundle should have multiple hosts in the output block, logs in the bundle should also indicate that the output block from the latest revision is being used

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels Mar 27, 2024
@michel-laterman
Copy link
Contributor Author

buildkite test this

@michel-laterman michel-laterman force-pushed the agent-output-bootstrap branch 3 times, most recently from 5600e30 to 9732cbb Compare April 2, 2024 19:55
@michel-laterman
Copy link
Contributor Author

buildkite test this

@michel-laterman michel-laterman changed the title wip agent-mode loads output from policy Apr 3, 2024
@michel-laterman michel-laterman marked this pull request as ready for review April 3, 2024 01:52
@michel-laterman michel-laterman requested a review from a team as a code owner April 3, 2024 01:52
@@ -379,6 +381,7 @@ func TestServerConfigErrorReload(t *testing.T) {
cancel()
}).Return(nil)
mReporter.On("UpdateState", client.UnitStateStopping, mock.Anything, mock.Anything).Return(nil)
mReporter.On("UpdateState", client.UnitStateFailed, mock.MatchedBy(func(err error) bool { return errors.Is(err, context.Canceled) }), mock.Anything).Return(nil).Maybe()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that without this in the mocked calls list the test will panic because the cancel() call when the state updates to healthy interrupts something and returns the context cancelled error triggering a fail-state detection instead of a stopping call

}}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
res := MergeElasticsearchFromPolicy(cfg, tc.pol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel-laterman why do we need to merge the policy here? cannot use what come from the config? the proxy and tls settings should be configured there too no? otherwise there is no way for user to change proxy and tls from Fleet right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the elastic-agent injection will not replace keys that are already present in local config:

https://github.com/elastic/elastic-agent/blob/2873f66e40fbbb490dc1367cb4cf2c9ccc937ae3/internal/pkg/agent/application/fleet_server_bootstrap.go#L67-L77

So the agent can set tls.CAs from policy if it was not an enrollment arg (i'm not sure if it serializes locally)
but the bigger issue is that the agent only ever sends a single host in the output block due to this behaviour

Fleet-server will retrieve and use the output from the policy when running in agent-mode.
This allows the fleet-server to connect to multiple Elasticsearch hosts if it is successful when
connecting to the host provided at enrolment/installation.
We expect that the host provided during enrollment/installation is never removed as a valid output.
Copy link
Contributor

@juliaElastic juliaElastic Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise an ingest-docs issue to document this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should

@cmacknz
Copy link
Member

cmacknz commented Apr 11, 2024

This approach does not solve for the edge-case where the URL used in bootstrapping (specified when enrolling/installing) goes down and the fleet-server needs to be restarted.
In this case fleet-server will not be healthy as it can't gather output config from the bootstrap ES host.
Additionally we still expect that the bootstrap es host will never be removed as a valid output.

As long as agent has received a policy from Fleet server at least once, it will have persisted it locally on disk. This should mean that except for the very first time agent checks in, there is a cached set of fleet server outputs that are going to arrive via the control protocol.

Are the control protocol outputs also added to the set of Fleet server hosts when they are receive? They are the same as the ones Fleet Server could query itself, they just come in via a different path.

Removing the bootstrap broker is interesting, it might make sense to model the bootstrap broker as an explicit thing in Fleet if somebody asks us for this. Its purpose is to get the policy the very first time before there are policy outputs to read from. At the same time, once you have gotten the policy at least once, there shouldn't ever be a need to rely only on the bootstrap output again.

@cmacknz
Copy link
Member

cmacknz commented Apr 11, 2024

Modelling the bootstrap broker explicitly in diagnostics as something separate from the outputs in the policy might also be a way to account for it diagnostics.

@michel-laterman
Copy link
Contributor Author

@cmacknz, just to be clear I have not adjusted any bootstrap behaviour in fleet-server or the elastic-agent.
I've just allowed the policy self-monitor to send updates to the config reload channel

We know that the (elastic-agent) component modifier adjusts the es hosts it sends to fleet-server to use only what is specified during enrollment when elastic-agent is not restarted if the desciption of agent persistency is correct:

As long as agent has received a policy from Fleet server at least once, it will have persisted it locally on disk. This should mean that except for the very first time agent checks in, there is a cached set of fleet server outputs that are going to arrive via the control protocol.

So we can have an edge case that occurs as follows:

  • ES hosts are set to A, B
  • fleet-server is enrolled with host A
    • policy monitor will send A, B to reload
  • fleet-server policy is adjusted by user (i.e., max_agent has changed)
  • elastic-agent will send updated policy with only host A (if the agent has not been restarted)
    • policy monitor will then sent A, B to reload again

I'm going to test what output is sent to a fleet-server after an agent restarts
My plan is to run elasticsearch + kibana locally, then enroll an agent running fleet-server and explicitly pass the es ca.
After it is healthy i'll adjust the output to have a different path to the (same) ca and an additional es host then restart the agent

@cmacknz
Copy link
Member

cmacknz commented Apr 12, 2024

We know that the (elastic-agent) component modifier adjusts the es hosts it sends to fleet-server to use only what is specified during enrollment when elastic-agent is not restarted if the description of agent persistency is correct:

Ugh right we are overriding the outputs in the policy. We'd need fleet server to cache the outputs list to get around this.

I don't want to blow up the scope of this change too much, but I wonder if all of this would be simpler if we defined Fleet server to always have two outputs, one is explicitly the bootstrap output and the other is the set of standard outputs in the policy (which may overlap with bootstrap). The bootstrap output only gets used if the standard outputs don't exist yet.

That way you'd benefit from the agent's cache of the most recent policy. Basically we'd want agent to keep injecting an output it would just be an additional output instead of overwriting the single output that exists today.

@michel-laterman
Copy link
Contributor Author

I tried the following for a test:

Created 8.14-SNAPSHOT on qa
Built an agent using the main branch (did not include fleet-server from this pr)

  • enrolled a new fleet-server/agent instance in fleet using a custom policy and default output with --es-certificate-authorities=/etc/ssl/certs/ca-certificates
  • copied ca file to another location (/tmp)
  • added ssl.certificate_authorities: ["/tmp/ca-certificates.crt"] to es output in kibana

the new ssl.certificate_authorities value did not appear in fleet-server.yml from any diagnostics i collected (before/after restarting the agent)

I also then tried uninstalling the agent, then re-installing without the --es-certificate-authorities flag and found that fleet-server's output includes

        tls:
            verification_mode: full
            certificate_authorities:
                - /tmp/ca-certificates.crt
            renegotiation: never

and the ca path appears in fleet-server's "initial server configuration" message.

The main finding is that the elastic-agent will never replace an attribute that was present during enrolment with a value from a policy when sending config to fleet-server.
As a result we still are going to have the existing edge cases around restarts

@michel-laterman
Copy link
Contributor Author

buildkite test this

}

// getFleetOutputName returns the output name that the fleet-server input of the policy uses
func getFleetOutputName(p *model.Policy) (string, bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic I've added this method so we use the use_output attribute of the fleet-server input to get the output name directly instead of scanning for the 1st elasticsearch type output

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
69.7% 69.7% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@michel-laterman michel-laterman merged commit fe7955b into elastic:main Apr 15, 2024
8 checks passed
@michel-laterman michel-laterman deleted the agent-output-bootstrap branch April 15, 2024 20:26
@lucabelluccini
Copy link
Contributor

lucabelluccini commented Apr 15, 2024

Hello!

  • Would it be possible to cover/describe the behavior of the ES hosts for Fleet Server integration in the documentation especially if there is some subtle consequence on first enrollment?
  • Can we make sure the EA diagnostic reflect the real hosts being used?
  • Possibly, describe the new behavior in the known issue https://support.elastic.dev/knowledge/view/f69b6e46

@michel-laterman
Copy link
Contributor Author

@cmacknz I created #3464 to handle output bootstrap tracking separately to limit how large this pr grows. Should I make another issue in fleet server to discuss/track my findings above?

The main finding is that the elastic-agent will never replace an attribute that was present during enrolment with a value from a policy when sending config to fleet-server.
As a result we still are going to have the existing edge cases around restarts

@lucabelluccini I can confirm that the fleet-server.yml will be accurate in collected diagnostics.
I'll add a section that outlines expected behaviour for 8.14+ in our support docs
I'll also work with our docs team to get our public docs updated as well.

@greicefaustino
Copy link

Hey team, in which version we will have this fixed? There's a tentative release version? Are there other dependencies to be merged?

@cmacknz
Copy link
Member

cmacknz commented Apr 16, 2024

8.14.0 would be the first version with the fix.

michel-laterman added a commit that referenced this pull request Apr 24, 2024
michel-laterman added a commit that referenced this pull request Apr 24, 2024
mergify bot pushed a commit that referenced this pull request Apr 24, 2024
This reverts commit fe7955b.

(cherry picked from commit 6768f25)
@lucabelluccini
Copy link
Contributor

lucabelluccini commented Apr 24, 2024

PLEASE NOTE: THIS FIX HAS BEEN REVERTED IN #3496
KB updated to reflect the old behavior https://support.elastic.co/knowledge/f69b6e46

michel-laterman added a commit that referenced this pull request Apr 24, 2024
This reverts commit fe7955b.

(cherry picked from commit 6768f25)

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet Server configuration does not contain all the hosts available in the Elasticsearch cluster
7 participants