Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Allow bundled installs to occur even if EPR is unreachable #125127

Conversation

kpollich
Copy link
Member

@kpollich kpollich commented Feb 9, 2022

Summary

Fixes #125097

Ref #74614

Allows package installations to occur even if the registry is unreachable by falling back to bundled packages on disk if they exist. Previously, we relied on the registry to determine the latest version of a package. Now, if we don't receive a response from the registry, we'll install the version on disk as if it's the latest version and prevent the registry error from stopping the installation process.

To test

Download a fleet_server zip from EPR into your fleet/server/bundled_packages folder and add a fleet_server entry to your kibana.dev.yml file. Set xpack.fleet.registryUrl to an inaccessible network location. Startup Kibana and observe that the install succeeds despite the network error.

# Not an accessible EPR instance - for testing
xpack.fleet.registryUrl: http://localhost:8080

# Configure the bundled package so it's installed during setup
xpack.fleet.packages:
  - name: fleet_server
    version: latest

@kpollich kpollich added release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team auto-backport Deprecated - use backport:version if exact versions are needed v8.1.0 v8.2.0 labels Feb 9, 2022
@kpollich kpollich requested a review from a team as a code owner February 9, 2022 18:10
@kpollich kpollich self-assigned this Feb 9, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@nchaulet nchaulet self-requested a review February 9, 2022 19:13
@nchaulet
Copy link
Member

nchaulet commented Feb 9, 2022

@kpollich Looks like this PR only cover the installation of the package but it still fail while trying to create a agent policy using that package

xpack.fleet.agentPolicies:
  # Cloud Agent policy
  - name: Elastic Cloud agent policy 00123123
    id: 'test123'
    description: Default agent policy for agents hosted on Elastic Cloud
    is_default: false
    is_default_fleet_server: false
    is_managed: true
    namespace: default
    monitoring_enabled: []
    package_policies:
      - name: fleet_server123456789
        id: elastic-fleet-server
        package:
          name: fleet_server
        inputs:
          - type: fleet-server
            keep_enabled: true
            vars:
              - name: host
                value: 127.0.0.1
                frozen: true
  

with this in my config I get the following error
Screen Shot 2022-02-09 at 2 24 49 PM

I looks at the code and we fetch the last version of the package in getPackageInfo

@joshdover
Copy link
Contributor

Similar to the issue that @nchaulet identified, I am encountering a similar problem (likely same root cause) when attempting to install a package via the /api/fleet/epm/packages/_bulk API (see below).

Also was there a reason we decided not to put this logic in the lower-level installPackage function so that it applies to all package installs rather than the bulkInstallPackages function?

POST http://localhost:5601/api/fleet/epm/packages/_bulk HTTP/1.1
User-Agent: vscode-restclient
content-type: application/json
kbn-xsrf: x
accept-encoding: gzip, deflate
content-length: 36

{
  "packages": [
    "endpoint-1.4.1"
  ]
}


HTTP/1.1 200 OK
x-content-type-options: nosniff
referrer-policy: no-referrer-when-downgrade
kbn-name: Joshs-MacBook-Pro.local
kbn-license-sig: eeb40cb14be51e184b25f7f502a421b572df03319f9f6ebdc53077b435c47b90
content-type: application/json; charset=utf-8
cache-control: private, no-cache, no-store, must-revalidate
content-length: 502
Date: Thu, 10 Feb 2022 12:05:59 GMT
Connection: close

{
  "items": [
    {
      "name": "endpoint-1.4.1",
      "statusCode": 502,
      "error": "Error connecting to package registry: request to http://locahost:9000/search?package=endpoint-1.4.1&experimental=true&kibana.version=8.1.0 failed, reason: getaddrinfo ENOTFOUND locahost"
    }
  ],
  "response": [
    {
      "name": "endpoint-1.4.1",
      "statusCode": 502,
      "error": "Error connecting to package registry: request to http://locahost:9000/search?package=endpoint-1.4.1&experimental=true&kibana.version=8.1.0 failed, reason: getaddrinfo ENOTFOUND locahost"
    }
  ]
}

@kpollich
Copy link
Member Author

Also was there a reason we decided not to put this logic in the lower-level installPackage function so that it applies to all package installs rather than the bulkInstallPackages function?

My thinking here was to limit the exposure of bundled packages to preconfiguration, but I'd be happy to take a pass in this PR of moving the logic down into installPackage instead. It might make a few things easier.

The current issue I'm running into is tracking down the many, many places we call functions like Registry.fetchInfo or getPackageInfo during preconfiguration and package installation. We frequently re-query these values since we can reliably pull them from the in-memory cache, so it seems like we never did any kind of memoization around the package info in favor of simply re-calling the fetcher functions to pull package info from Fleet's cache.

The other issue that's surfaced as I've tried to update all preconfiguration/installation paths to account for bundled packages is our policy compilation logic, which relies on a RegistryPackage object e.g.

public async _compilePackagePolicyInputs(
registryPkgInfo: RegistryPackage,
pkgInfo: PackageInfo,
vars: PackagePolicy['vars'],
inputs: PackagePolicyInput[]
): Promise<PackagePolicyInput[]> {
const inputsPromises = inputs.map(async (input) => {
const compiledInput = await _compilePackagePolicyInput(registryPkgInfo, pkgInfo, vars, input);
const compiledStreams = await _compilePackageStreams(registryPkgInfo, pkgInfo, vars, input);
return {
...input,
compiled_input: compiledInput,
streams: compiledStreams,
};
});
return Promise.all(inputsPromises);
}

Hopefully updating type definitions w/ a union and passing in a bundled package object in some way here will be enough to get this functional, but I am a bit concerned whether we're opening a can of worms here. I'll update here with progress.

@kpollich
Copy link
Member Author

Okay I think I've roughly stiched enough things together to make the creation of preconfigured agent policies succeed when EPR is unreachable in aba45c7. The code needs some clean up and I had to touch a lot of one-off function calls to add a useBundledPackages option, which feels like a substantial code smell.

I'm wondering if we should implement a rule that we'll always use bundled packages for a given package version to avoid passing this option around to the entire universe. Or, we could alternatively always attempt to fall back to bundled packages in the case of an unreachable registry.

@kpollich
Copy link
Member Author

Or, we could alternatively always attempt to fall back to bundled packages in the case of an unreachable registry.

After typing this out, this seems like a much better solution so I'm going to implement this.

@nchaulet
Copy link
Member

Okay I think I've roughly stiched enough things together to make the creation of preconfigured agent policies succeed when EPR is unreachable in aba45c7. The code needs some clean up and I had to touch a lot of one-off function calls to add a useBundledPackages option, which feels like a substantial code smell.

@kpollich Yes the useBundledPackages seems not the best solution here as it add knowledge about bundled package to a lot of part of the code where we should probably not and only use the installed package. Looking at the code most of problematic part where we call the registry we should have installed the package so we should be able to retrieve it from ES instead of the registry (all the registry fetchInfo in the package policy service)

It seems a non trivial change, but a better long term solution

@@ -162,16 +164,33 @@ class PackagePolicyService {
}
validatePackagePolicyOrThrow(packagePolicy, pkgInfo);

const registryPkgInfo = await Registry.fetchInfo(pkgInfo.name, pkgInfo.version);
const installablePackage: InstallablePackage = await Registry.fetchInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the package should be installed right? can we directly fetch from ES the package instead of relying on the registry or the bundled package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe you are correct. I'm not sure why, but all of the policy compilation logic expected a RegistryPackage object. I would've assumed we would be fine with PackageInfo here and so we could fetch from ES. I will take a look at this. You're right above that it's non-trivial but this is probably the "right" solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it wound up being pretty easy to replace the InstallablePackage stuff with a PackageInfo object - just needed to make a few changes here and there to support that. We only use a few fields here, and getPackageInfo will pull back an Installation object for installed packages, so I think we've pretty easily eliminated the registry call here. See e2f2fea

@joshdover joshdover self-requested a review February 11, 2022 12:32
@botelastic botelastic bot added the Team:APM All issues that need APM UI Team support label Feb 15, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@kpollich
Copy link
Member Author

Summarizing an issue with failing CI here.

The version of the docker registry in tests did not contain endpoint version 1.3.1-dev.1. We're bundling 1.4.1 with Kibana's source. Because we call Fleet's setup process before the registry container is available, our "fall back to bundled packages" logic introduced in this PR was resolving the latest version of endpoint to the bundled version of 1.4.1 Then, when the registry came online, we were resolving a latest version of 1.3.1-dev.1 during tests from the registry.

The solution here is to make sure the dockerized registry and the bundled versions are aligned.

This does bring to mind a potential issue with our fallback logic. Would it make sense to add a check so that we include installed packages in our fall back logic when we check for the latest version of a given package? We could have similar issues to this in cases where users are running a dockerized instance of the registry with outdated packages that goes down, resulting in Kibana resolving from bundled packages temporarily.

@@ -65,34 +66,82 @@ export async function fetchList(params?: SearchParams): Promise<RegistrySearchRe
return fetchUrl(url.toString()).then(JSON.parse);
}

// When `throwIfNotFound` is true or undefined, return type will never be undefined.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshdover I had to refactor these overrides you introduced in #125525 because I couldn't actually pass a boolean value to this function in fetchFindLatestPackageWithFallbackToBundled below. Navigating overrides and conditional types was proving very difficult (and the number of typescript docs/stack overflow browser tabs I was reaching was proving unsustainable...), so I opted to pull things out into separate methods w/ a shared _fetchFindLatestPackage underneath that contains the "meat" of implementation.

Copy link
Contributor

@cauemarcondes cauemarcondes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM changes LGTM, but it will no longer needed once this PR is merged #125724

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1147 1151 +4
Unknown metric groups

API count

id before after diff
fleet 1263 1267 +4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @kpollich

@kpollich kpollich merged commit ba8c339 into elastic:main Feb 16, 2022
@kpollich kpollich deleted the 125097-prevent-registry-network-errors-blocking-fleet-bulk-install branch February 16, 2022 19:39
@kibanamachine
Copy link
Contributor

💔 Backport failed

The pull request could not be backported due to the following error:
There are no branches to backport to. Aborting.

How to fix

Re-run the backport manually:

node scripts/backport --pr 125127

Questions ?

Please refer to the Backport tool documentation

@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 125127 or prevent reminders by adding the backport:skip label.

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 18, 2022
@kpollich kpollich added the backport:skip This commit does not require backporting label Feb 19, 2022
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 19, 2022
kpollich added a commit to kpollich/kibana that referenced this pull request Feb 28, 2022
…lastic#125127)

* Allow bundled installs to occur even if EPR is unreachable

* Fix type errors in test

* Fix failing test

* fixup! Fix failing test

* Remove unused object in mock

* Make creation of preconfigured agent policy functional

* Always fall back to bundled packages if available

* Remove unused import

* Use packageInfo object instead of RegistryPackage where possible

* Fix type error in assets test

* Fix test timeouts

* Fix promise logic for registry fetch fallback

* Use archive package as default in create package policy

* Always install from bundled package if it exists - regardless of installation context

* Clean up + refactor a bit

* Default to cached package archive for policy updates

* Update mock in get.test.ts

* Add test for install from bundled package logic

* Delete timeout call in security solution tests

* Fix unused var in endpoint test

* Fix another unused var in endpoint test

* [Debug] Add some logging to test installation times in CI

* Revert "[Debug] Add some logging to test installation times in CI"

This reverts commit 513dcc1.

* Update docker images for registry

* Update docker image digest again

* Refactor latest package fetching to fix broken logic/tests

* Fix a bunch of type errors around renamed fetch latest package version methods

* Remove unused import

* Bump docker version to latest snapshot (again)

* Revert changes to endpoint tests

* Pass experimental flag in synthetics tests

* Fix endpoint version in fleet api integration test

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit ba8c339)

# Conflicts:
#	x-pack/plugins/fleet/server/integration_tests/docker_registry_helper.ts
#	x-pack/test/fleet_api_integration/apis/package_policy/create.ts
#	x-pack/test/fleet_api_integration/config.ts
#	x-pack/test/functional/config.js
#	x-pack/test/functional_synthetics/config.js
kpollich added a commit that referenced this pull request Mar 1, 2022
…125127) (#126523)

* Allow bundled installs to occur even if EPR is unreachable

* Fix type errors in test

* Fix failing test

* fixup! Fix failing test

* Remove unused object in mock

* Make creation of preconfigured agent policy functional

* Always fall back to bundled packages if available

* Remove unused import

* Use packageInfo object instead of RegistryPackage where possible

* Fix type error in assets test

* Fix test timeouts

* Fix promise logic for registry fetch fallback

* Use archive package as default in create package policy

* Always install from bundled package if it exists - regardless of installation context

* Clean up + refactor a bit

* Default to cached package archive for policy updates

* Update mock in get.test.ts

* Add test for install from bundled package logic

* Delete timeout call in security solution tests

* Fix unused var in endpoint test

* Fix another unused var in endpoint test

* [Debug] Add some logging to test installation times in CI

* Revert "[Debug] Add some logging to test installation times in CI"

This reverts commit 513dcc1.

* Update docker images for registry

* Update docker image digest again

* Refactor latest package fetching to fix broken logic/tests

* Fix a bunch of type errors around renamed fetch latest package version methods

* Remove unused import

* Bump docker version to latest snapshot (again)

* Revert changes to endpoint tests

* Pass experimental flag in synthetics tests

* Fix endpoint version in fleet api integration test

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit ba8c339)

# Conflicts:
#	x-pack/plugins/fleet/server/integration_tests/docker_registry_helper.ts
#	x-pack/test/fleet_api_integration/apis/package_policy/create.ts
#	x-pack/test/fleet_api_integration/config.ts
#	x-pack/test/functional/config.js
#	x-pack/test/functional_synthetics/config.js

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support Team:Fleet Team label for Observability Data Collection Fleet team v8.2.0
Projects
None yet
9 participants