-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V2: 8.6.0-SNAPSHOT not starting in ESS #1731
Comments
Adding slack threads https://elastic.slack.com/archives/C01G6A31JMD/p1668556010267939 and https://elastic.slack.com/archives/C0JFN9HJL/p1668594188294829 This should be considered a blocker for 8.6.0. |
The slack thread mentions this is happening in the 8.5 snapshot as well, which does not contain agent V2. Have we confirmed this? If so it will change what we need to investigate. |
Possibly this was caused by the packaged versions in the ESS release not being in sync, I am still trying to follow all of the context on this to confirm. |
@cmacknz I saw the issue as part of 8.6.0 but someone else said they had the issue on 8.5.1 however I did not see that. Another thread follows a release sync fix for 8.6.0, at least from my side, I would have to wait for a new 8.6 snapshot to confirm if the release sync indeed fixes it or if something else is also broken. |
I poked around on that cloud deployment today: https://admin.staging.foundit.no/deployments/54481a66f7994042aca321c07461871f I pulled down the same docker image that was used by that deployment locally, was able to enroll the agent with the fleet server against 8.6.0-SNAPSHOT cloud deployment. All the beats processes seemed to be running ok: How is the agent with the fleet-server is started/set up in the cloud? |
@AlexP-Elastic is there someone on the cloud side who can help us debug this? It seems to only be reproducible on cloud but not locally. |
Already shared internal docs with @aleksmaus about how to ssh into testing environments or spin up an ECE deployment for testing. |
I don't have the ssh access to the cloud env yet, still waiting. So, I tried to reproduce the issue with ecetest environment (thanks for @simitt for all the help working through caveats of making it running) Kibana now has all kind of restrictions/validations, can't spin up the fleet server with "http" url, can't modify the exiting fleet policy to add APM configuration for example, etc, making it difficult to adjust the policy settings and check the results. Few observations so far:
The fleet server agent is shown as healthy in kibana: The components state reporting "Configuring" initially
After few minutes it all is reported as healthy:
But the fleet and APM is shown as unhealthy in the deployment admin web UI and the docker container is getting killed after some period of this.
In since all the monitoring settings are missing it uses the default settings that all are set to "true".
Changing logging level is propagated to the agent and causes the agent relaunch with enroll
Will keep digging, but if someone on the agent team involved in the fleet bootstrap, running with "container" command and monitoring has some ideas, let me know. |
@simitt you mentioned before that the logs location could affect the reported health
|
Some more pieces of observation.
The fleet-server has a proper config it seems:
The debug log after restart is healthy:
To be continued on Monday. |
I noticed while updating APM Server to support V2 that the fleet-server |
Should that be fixed with #1745?
In the past, it was important that the Elastic Agent would not shut down if there was an issue with Fleet Server, while a healthy APM Server was running. This was important to not destabilize GA versions of APM Server while EA and Fleet Server were in the early stage (see related elastic/beats#28209). Afair, the healthcheck used to be based on the response when calling Since then, the Elastic Agent team has worked on a new health status design for the EA. The Integrations Server on cloud was introduced, and EA & Fleet Server are mature. I am unfortunately not in the picture of the current state of the cloud healthcheck and what might have been changed.
I don't think it will impact the health, but it should be tested that monitoring still works (both - the operational, cloud regional metrics collection and the customer facing monitoring when enabled). |
This is an interesting clue. Indeed the health status is now all lower case "healthy", when before in 8.5 in was "HEALTHY". @simitt does it make a difference when the health is checked? |
I honestly don't know, but as mentioned above (our comments just overlapped) I am not certain |
Looks like the |
When |
It looks like we may have to update all of the cloud monitoring configurations for 8.6 to have it work with the I don't see the |
We definitely use (I think this is the reason for it dying) private def checkForManagedApmServerMode(): Future[ApmInstanceMode] = {
val request = {
val r = httpBundle.request
r.withUri(r.uri.withPath(Uri.Path("/processes")))
}
val futureResult = httpBundle.client.httpRequest(request) map { result =>
import result.response
traceRequestResponse(response, request, response.status.isSuccess)
if (response.status.isSuccess) {
val responseBody = result.entity.asString
val tryParse = Try(mapper.readRequiredObjectNode(responseBody))
tryParse match {
case Failure(exception) => {
logger.warn(s"Invalid JSON returned from fleet processes API. Received [$responseBody]", exception)
None
}
case Success(node) => {
node
.path("processes")
.elements()
.asScala
.find(n => n.path("id").asText().startsWith("apm-server"))
.map(_ => ApmInstanceMode.Managed)
}
}
} else None
}
futureResult
.recover {
case e =>
logger.warn(s"Managed APM check request to [${request.uri}] failed", e)
None
}
.map(mode => mode.getOrElse(ApmInstanceMode.Standalone))
} We also use it for metrics collection val fleetConfigNode = yamlMapper.readValue(
s"""
|hosts:
| - "${getAgentMonitoringConnectionString}"
|metricsets:
| - json
|module: http
|namespace: $processName
|path: /processes/$processName$pathSuffix |
Re-opening this until we can confirm we have actually solved the problem. One thing I am noticing is the ID keys are different between 8.5 and 8.6 and cloud depends on them being stable: This is 8.5 with IDs
This is 8.6 right now with IDs
|
Including fleet server gives me |
Kicked off a new 8.6 snapshot build to confirm we have fixed this. I will request a new BC once I confirm this fixes the problem, ideally I want to avoid requesting multiple new BCs if we need to iterate on the fix. |
Snapshot complete, I can see fleet server started and integrations server is healthy. |
@cmacknz did you observe the APM Server being able to receive requests? I've just created a deployment with 8.6.0-45476311-SNAPSHOT, and although Integrations Server claims to be healthy, the APM endpoint is greyed out in the console and the proxy is rejecting requests. The logs indicate that APM Server is running. Our (local, Docker-based) system tests are all passing, so it seems likely that this is related to orchestration - perhaps some residual monitoring/healthcheck issues? |
No, I selfishly only checked fleet-server. If this is related to the same problem, likely it is the key we are using for the processes endpoint differing from what the cloud monitoring expects. My day is over, but if you can get the output of the agent's Per #1773 (comment) it needs to have the ID The fix would be to adjust the name to be what cloud expects in the processes handler, which should be trivial. That code is here:
If you can confirm the key being different is the problem just create a PR, that will be the fastest way to fix this. If the problem is something else we have more investigating to do. It is possible the APM health is tied into the metrics reporting we haven't fixed yet. |
Running locally, I had to mount an elastic-agent.yml into the container with agent.monitoring:
http:
enabled: true to expose the Anyway, the results are:
|
I don't have enough context to know what the right fix is here in order to open a PR. I've opened https://github.com/elastic/ingest-dev/issues/1418 |
Version:
8.6.0-SNAPSHOT
(Nov 16th 2022)Problem: The integrations server does not start and therefore neither Fleet Server nor APM Server are available. The logs indicate that Elastic Agent tries to connect to
localhost
.Logs:
Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get "http://localhost:9200": dial tcp [::1]:9200: connect: cannot assign requested address
Reproduce: Spin up a
8.6.0-SNAPSHOT
deployment in the CFT on ESSThe text was updated successfully, but these errors were encountered: