Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to heartbeat controller #2046

Closed
matt2e opened this issue Jul 11, 2024 · 5 comments
Closed

Failed to heartbeat controller #2046

matt2e opened this issue Jul 11, 2024 · 5 comments
Assignees
Labels

Comments

@matt2e
Copy link
Collaborator

matt2e commented Jul 11, 2024

failed to heartbeat controller: duplicate key value violates unique constraint "controller_endpoint_not_dead_idx": conflict

Might be related to this one: #2045

@github-actions github-actions bot added the triage Issue needs triaging label Jul 11, 2024
@ftl-robot ftl-robot mentioned this issue Jul 11, 2024
@matt2e matt2e self-assigned this Jul 11, 2024
@github-actions github-actions bot removed the triage Issue needs triaging label Jul 11, 2024
@matt2e
Copy link
Collaborator Author

matt2e commented Jul 11, 2024

Assigned it to me because I think the next steps are just seeing if other fixes (like controller not panicing when getting an error back from calling a verb) make this go away.

@matt2e
Copy link
Collaborator Author

matt2e commented Jul 12, 2024

Havent seen this error the past few times we restarted the controllers, closing for now

@matt2e matt2e removed their assignment Jul 15, 2024
@matt2e matt2e added the triage Issue needs triaging label Jul 15, 2024
@matt2e
Copy link
Collaborator Author

matt2e commented Jul 15, 2024

This might be back...

@wesbillman wesbillman added next Work that will be be picked up next P2 labels Jul 15, 2024
@github-actions github-actions bot removed the triage Issue needs triaging label Jul 15, 2024
@wesbillman
Copy link
Member

Once we have a k8s env setup for local debugging, we should did into this again.

@matt2e
Copy link
Collaborator Author

matt2e commented Jul 16, 2024

Looks like this is the cause:

  • Controller is alive
  • We take down the controller
  • We bring up another controller. The controller ends up having the same endpoint as the original controller
  • New controller tries to heartbeat the db
    • This fails because there is already a controller row with state = 'live', a matching endpoint, and a different controller key
  • A controller runs the reapStaleControllers job which finds live controller in the db which haven't been seen in the past 10s. Maybe the first time this happens we are still in the 10s window. But it shouldn't take long for one of these jobs to find and mark the old controller's row as dead
  • New controller keeps trying to heartbeat, and will eventually succeed in upserting its row into the db.

You can repro locally by doing this:

When I did this, the first heartbeat failed like in the issue description, then the staler controller was reaped, then the next controller heartbeat worked. Seems like everything recovered fine.

@matt2e matt2e closed this as completed Jul 16, 2024
@matt2e matt2e self-assigned this Jul 16, 2024
@github-actions github-actions bot removed the next Work that will be be picked up next label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants