Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers disconnect after an unknown period of time #1530

Open
smacfarlane opened this issue Sep 28, 2020 · 7 comments
Open

Workers disconnect after an unknown period of time #1530

smacfarlane opened this issue Sep 28, 2020 · 7 comments

Comments

@smacfarlane
Copy link
Contributor

After some amount of time, workers stop respond to new jobs. This has only been observed on Windows and Kernel2 workers.
The observed behavior is a job remains in the Dispatching state until the cfg.job_timeout period elapses and is then cancelled.
The worker is connected and we see heartbeats continue. It also remains present in metrics dashboard. Our heartbeat channel is separate from our job dispatch channel.

Currently, the remediation is to restart the builder-worker service on affected build nodes.

It appears that the zmq::ROUTER socket is no longer transmitting messages to the client. It is a known zmq pattern that if a client connects to a ROUTER socket, but does not send heartbeats, it may timeout and the server won't be able to reconnect. We suspect we need to send KEEPALIVES as described in https://zguide.zeromq.org/docs/chapter4/#Heartbeating to keep the channel alive.

An alternate implementation would be to send jobs to workers to keep them alive. The downside is that we don't know how frequently we would need to dispatch to keep them alive.

@stale
Copy link

stale bot commented Sep 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Sep 28, 2021
@mwrock
Copy link
Contributor

mwrock commented Dec 2, 2021

Jobsrv reports:

Dec 02 21:19:29 ip-10-0-0-100 hab[596]: builder-jobsrv.acceptance(O): [2021-12-02T21:19:29Z WARN  habitat_builder_jobsrv::server::worker_manager] Failed to dispatch job to worker 3967@ip-10-0-0-192, err=Zmq(Host unreachable)

@pozsgaic
Copy link
Contributor

pozsgaic commented Apr 11, 2022

This problem can be created on a linux target as well.

  • If you stop the builder-worker immediately after it sends its heartbeat (30s period by default), and then launch a job from builder, this problem can be created consistently. Worker was shut down cleanly with hab svc stop.

  • The job status transitions to CancelComplete after the job timeout (60 minutes) is reached.

  • The project status is set to Canceled

  • The group status is set to Canceled. It remained at 'Dispatching' until the timeout was reached.

  • One minor difference is that in this case the heartbeats stopped when we shut down the service. Despite this,
    the handling of the job was the same.

@stale stale bot removed the Stale label Apr 11, 2022
@pozsgaic
Copy link
Contributor

pozsgaic commented Apr 11, 2022

  • Forcing down the builder-worker with "sudo kill -9" exhibits different behavior because the hab supervisor will restart the service when it does not get shut down cleanly.
  • The job does get submitted successfully and is in a Pending state, assigned to the original worker that we forced down.
  • The job get dispatched correctly to the new worker that was restarted by hab supervisor and builds successfully. The worker field gets changed to the new worker that has picked up the job.

Note also that the builder database has a table 'busy_workers' that shows the active builder workers. When a worker instance goes down while in the busy state, its failure to send a heartbeat will result in this worker being removed from jobsrv and from the busy_workers table. The job will transition to a pending state where it will remain until a new worker for the target (e.g. x86_64-linux) becomes available or the job timeout (60 minutes default) is reached.

@pozsgaic
Copy link
Contributor

  • It is not clear how long it will take before a worker stops responding to a job request. It is seemingly a long time because this issue is fairly rare and was not reproducible with sample code of a ROUTER socker server with two DEALER sockets connected. These apps did not lose connection for the many hours they ran, and this is without keep alive configured in the ROUTER end.
  • According to the zmq online docs, we want to either maintain a heartbeat over the same channel we send data or set the socket to keep alive when we establish our listener ROUTER socket in builder-jobsrv. We currently maintain connections to the jobsrv instances in both the heartbeat manager and the main server in builder-worker.
  • If we want to go the heartbeat route, it would be best to remove the heartbeat socket and have the heartbeats go over the job dispatch channel. Then if we have a heartbeat timeout we receive it in the job dispatch channel and remove the worker. It will not appear in the worker list until jobsrv receives a new heartbeat. Also, we would want to ensure the heartbeats are of lower priority if we move to the job dispatch channel so as not to interfere with the jobs. Also we would count a job status message as a successful heartbeat and advance the timeout on reception.
  • If we want to go the keep alive route, then we would establish when we create the ROUTER socket with the set_tcp_keepalive call.

@pozsgaic
Copy link
Contributor

pozsgaic commented Apr 18, 2022

While rust does support setting up keep alive and this would result in the ROUTER socket in builder-jobsrv continuing to test for connectivity. What I am not clear on is how will we know when the client disconnects? We want to know immediately if a client is disconnected so we can ensure it is no longer in the builder-jobsrv worker list.

Using our heartbeats at the application layer we are sending from the builder-worker instance to the builder-jobsrv instance. The absence of heartbeats will result in a disconnected state and ultimately results in the worker instance being removed from the worker list. This is desirable because we will know within 1 heartbeat if we've lost a client connection.

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants