Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fleet not coming back after reboot #23

Closed
peasead opened this issue Apr 3, 2023 · 10 comments · Fixed by #30
Closed

[BUG] Fleet not coming back after reboot #23

peasead opened this issue Apr 3, 2023 · 10 comments · Fixed by #30
Assignees
Labels
bug Something isn't working

Comments

@peasead
Copy link
Owner

peasead commented Apr 3, 2023

Describe the bug
When rebooting the VM that the Fleet container is running on, the Fleet server isn't coming back online. It is unclear what the actual problem is right now.

./elastic-container.sh status
NAME                IMAGE                                                 COMMAND                  SERVICE             CREATED             STATUS                    PORTS
ecp-elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:8.7.0   "/bin/tini -- /usr/l…"   elasticsearch       3 days ago          Up 17 minutes (healthy)   0.0.0.0:9200->9200/tcp, 9300/tcp
ecp-fleet-server    docker.elastic.co/beats/elastic-agent:8.7.0           "/usr/bin/tini -- /u…"   fleet-server        3 days ago          Up 17 minutes             0.0.0.0:8220->8220/tcp
ecp-kibana          docker.elastic.co/kibana/kibana:8.7.0                 "/bin/tini -- /usr/l…"   kibana              3 days ago          Up 17 minutes (healthy)   0.0.0.0:5601->5601/tcp
curl -vvv -k https://localhost:8220
*   Trying 127.0.0.1:8220...
* Connected to localhost (127.0.0.1) port 8220 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
* [CONN-0-0][CF-SSL] (304) (OUT), TLS handshake, Client hello (1):
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8220
* Closing connection 0
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8220

To Reproduce
Steps to reproduce the behavior:

  1. Deploy the ECP with ./elastic-container.sh start
  2. Once completed, reboot the VM that the containers are running on
  3. Restart the ECP containers with ./elastic-container restart
  4. Fleet is unavailable in Kibana and the error above is present

Expected behavior
Fleet should be operational after a reboot.

Screenshots
image

Desktop (please complete the following information):

  • OS: [e.g. iOS] macOS Ventura 13.3, also reported on Ubuntu
  • Browser [e.g. chrome, safari]: Chrome, but error is also evident with cURL, so not a browser issue
  • Version [e.g. 22]: 8.7.0

Additional context
Going to try an downgrade to 8.6 and then 8.5 to see if this is an 8.7 thing or something else.

@peasead peasead added the bug Something isn't working label Apr 3, 2023
@peasead peasead self-assigned this Apr 3, 2023
@peasead
Copy link
Owner Author

peasead commented Apr 3, 2023

I think the issue may be if you reboot the host VM without stopping the containers, something is happening with TLS.

Test:

  • Installed using version 8.6.2 (likely the same result for 8.7.0 based on reporting)
  • Ran ./elastic-container.sh start, verified everything worked
  • Ran ./elastic-container.sh stop, rebooted the host
  • When the host came back up, ran ./elastic-container.sh restart, verified everything worked
  • Rebooted the host without running ./elastic-container.sh stop first
  • Fleet server has the TLS errors

@peasead
Copy link
Owner Author

peasead commented Apr 4, 2023

As a workaround, you can run ./elastic-container.sh stop before rebooting the host that is running the stack.

We're working on a better solution, but this workaround should help while we work this out.

@rasta-mouse
Copy link

rasta-mouse commented Apr 5, 2023

I rolled back to d85df01 for 8.6.0 and got the same issue there as well.

ubuntu@elk:~$ docker ps
CONTAINER ID   IMAGE                                                 COMMAND                  CREATED        STATUS                   PORTS                                                 NAMES
42827144627a   docker.elastic.co/beats/elastic-agent:8.6.0           "/usr/bin/tini -- /u…"   17 hours ago   Up 7 minutes             0.0.0.0:8220->8220/tcp, :::8220->8220/tcp             ecp-fleet-server
2e6a2519746a   docker.elastic.co/kibana/kibana:8.6.0                 "/bin/tini -- /usr/l…"   17 hours ago   Up 6 minutes (healthy)   0.0.0.0:5601->5601/tcp, :::5601->5601/tcp             ecp-kibana
aadcb438ea15   docker.elastic.co/elasticsearch/elasticsearch:8.6.0   "/bin/tini -- /usr/l…"   17 hours ago   Up 7 minutes (healthy)   0.0.0.0:9200->9200/tcp, :::9200->9200/tcp, 9300/tcp   ecp-elasticsearch

ubuntu@elk:~$ curl -vvv -k https://localhost:8220
*   Trying 127.0.0.1:8220...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8220 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8220 
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8220

@rasta-mouse
Copy link

Commit 7f61c3a for 8.5.0 seems to work fine.

@peasead
Copy link
Owner Author

peasead commented Apr 5, 2023

Thanks for testing the other versions and raising this issue, Rasta.

Our suspicion is that there is a state file somewhere that isn't being gracefully managed when you reboot the host without stopping the stack first.

We're unsure if this is an ECP issue or an issue with the Fleet container.

Next steps will be to spin up an Elastic Stack using the default config from Elastic, test the scenarios, and either make an adjustment to how the state file is managed (if that's the case) or file a bug with the Fleet team.

We'll try to do these tests this weekend. Until then, while not ideal, if you run the stop command before rebooting, that seems to prevent this issue.

@peasead
Copy link
Owner Author

peasead commented Jun 12, 2023

This is possibly an upstream issue. It has been recreated and being tracked.

elastic/fleet-server#2431

@peasead
Copy link
Owner Author

peasead commented Jul 7, 2023

Looks like this will be fixed in 8.9.

elastic/fleet-server#2431 (comment)

@rasta-mouse
Copy link

Awesome. I'm happy to close this issue if you are?

@peasead
Copy link
Owner Author

peasead commented Jul 10, 2023

I'd like to keep it open so I remember to test and bump to 8.9 😅

@peasead
Copy link
Owner Author

peasead commented Jul 26, 2023

Verified with 8.9.0 in the project that the Fleet server comes back online. Bumping version in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants