-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush.agents_autodelete deletes activate agents #3424
Comments
@srkoster: Thanks for opening an issue, it is currently awaiting triage. In the meantime, you can:
DetailsI am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository. |
Hey thank you for the detailed report and the pull request, I tagged as accepted and added it to the next release as it priority is high. |
fixed by: #3425 Note that the update will be published via |
Sweet, thanks for the quick reaction |
What happened?
I'm running Crowdsec with the crowdsec helm-chart on Kubernetes and am using Postgresql as my database. My agents use an auto registration token to register themselves with the LAPI, so after a while, the list of agents grows and I would like to clean the unused agent entries.
I can do it manually with the cli
cscli machines prune
which works fine. Things got strange when I activated theagents_autodelete
in mydb_config
:After about 1 hour all my agent pods where in a crashloop with the message "unable to start crowdsec routines: authenticate watcher (name of pod): API error: ent: machine not found"
I found out that all my machines where deleted from the LAPI (both through
cscli machines list
and looking in the database).What did you expect to happen?
Only agents that had a latest heartbeat <= now - 30 minutes should be deleted.
Active agents should be kept in the machines list of the LAPI
How can we reproduce it (as minimally and precisely as possible)?
On version 1.6.4, set the
db_config
as indicated and wait for 1 hour.Anything else we need to know?
When analyzing my Postgresql database logs, I found the problem. Every now and then, a query similar to the one below was being issued by crowdsec. It updates the
machines
table for an agent and also updated thelast_heartbeat
column. However, it sets thelast_heartbeat
column with the first timestamp that the agent was seen (retrieved from the metrics). If the flush operation is run after this query, and before the next heartbeat update from the agent, the agent gets deleted falsely.The origin of the logic for the
last_heartbeat
update resides in the MachineUpdateBaseMetrics which I believe shouldn't update thelast_heartbeat
column at all.Crowdsec version
OS version
Enabled collections and parsers
No response
Acquisition config
No response
Config show
No response
Prometheus metrics
No response
Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.
No response
The text was updated successfully, but these errors were encountered: