Flush.agents_autodelete deletes activate agents #3424

srkoster · 2025-01-23T16:53:59Z

What happened?

I'm running Crowdsec with the crowdsec helm-chart on Kubernetes and am using Postgresql as my database. My agents use an auto registration token to register themselves with the LAPI, so after a while, the list of agents grows and I would like to clean the unused agent entries.

I can do it manually with the cli cscli machines prune which works fine. Things got strange when I activated the agents_autodelete in my db_config:

db_config:
  type:     postgresql
  ...
  flush:
    agents_autodelete:
      login_password: 30m

After about 1 hour all my agent pods where in a crashloop with the message "unable to start crowdsec routines: authenticate watcher (name of pod): API error: ent: machine not found"

I found out that all my machines where deleted from the LAPI (both through cscli machines list and looking in the database).

What did you expect to happen?

Only agents that had a latest heartbeat <= now - 30 minutes should be deleted.
Active agents should be kept in the machines list of the LAPI

How can we reproduce it (as minimally and precisely as possible)?

On version 1.6.4, set the db_config as indicated and wait for 1 hour.

Anything else we need to know?

When analyzing my Postgresql database logs, I found the problem. Every now and then, a query similar to the one below was being issued by crowdsec. It updates the machines table for an agent and also updated the last_heartbeat column. However, it sets the last_heartbeat column with the first timestamp that the agent was seen (retrieved from the metrics). If the flush operation is run after this query, and before the next heartbeat update from the agent, the agent gets deleted falsely.

The origin of the logic for the last_heartbeat update resides in the MachineUpdateBaseMetrics which I believe shouldn't update the last_heartbeat column at all.

Crowdsec version

$ cscli version
version: v1.6.4-523164f6
Codename: alphaga
BuildDate: 2024-11-26_13:16:01
GoVersion: 1.23.3
Platform: docker
libre2: C++
User-Agent: crowdsec/v1.6.4-523164f6-docker
Constraint_parser: >= 1.0, <= 3.0
Constraint_scenario: >= 1.0, <= 3.0
Constraint_api: v1
Constraint_acquis: >= 1.0, < 2.0
Built-in optional components: cscli_setup, datasource_appsec, datasource_cloudwatch, datasource_docker, datasource_file, datasource_http, datasource_journalctl, datasource_k8s-audit, datasource_kafka, datasource_kinesis, datasource_loki, datasource_s3, datasource_syslog, datasource_wineventlog

OS version

# On Linux:
$ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.20.3
PRETTY_NAME="Alpine Linux v3.20"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
Linux crowdsec-lapi-59ddb58f96-dzxcr 6.6.58-talos #1 SMP Mon Oct 28 10:12:50 UTC 2024 x86_64 Linux

Enabled collections and parsers

No response

Acquisition config

No response

Config show

No response

Prometheus metrics

No response

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-23T16:54:12Z

@srkoster: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

Check Crowdsec Documentation to see if your issue can be self resolved.
You can also join our Discord.
Check Releases to make sure your agent is on the latest version.

Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

LaurenceJJones · 2025-01-23T17:29:39Z

Hey thank you for the detailed report and the pull request, I tagged as accepted and added it to the next release as it priority is high.

LaurenceJJones · 2025-01-23T17:56:51Z

fixed by: #3425

Note that the update will be published via 1.6.5, however, for the time being you can point your image tag to :dev which will be built in the next hour to include your fix.

srkoster · 2025-01-23T17:58:28Z

Sweet, thanks for the quick reaction

srkoster added the kind/bug Something isn't working label Jan 23, 2025

github-actions bot added version/1.6.4 needs/triage labels Jan 23, 2025

srkoster mentioned this issue Jan 23, 2025

Removed updating of machine last_heartbeat based on baseMetrics in MachineUpdateBaseMetrics #3425

Merged

LaurenceJJones added triage/accepted and removed needs/triage labels Jan 23, 2025

LaurenceJJones closed this as completed Jan 23, 2025

srkoster mentioned this issue Jan 23, 2025

agents re-register when restarting daemonset crowdsecurity/helm-charts#230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush.agents_autodelete deletes activate agents #3424

Flush.agents_autodelete deletes activate agents #3424

srkoster commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

LaurenceJJones commented Jan 23, 2025

LaurenceJJones commented Jan 23, 2025

srkoster commented Jan 23, 2025 •

edited

Loading

Flush.agents_autodelete deletes activate agents #3424

Flush.agents_autodelete deletes activate agents #3424

Comments

srkoster commented Jan 23, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Crowdsec version

OS version

Enabled collections and parsers

Acquisition config

Config show

Prometheus metrics

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

github-actions bot commented Jan 23, 2025

LaurenceJJones commented Jan 23, 2025

LaurenceJJones commented Jan 23, 2025

srkoster commented Jan 23, 2025 • edited Loading

srkoster commented Jan 23, 2025 •

edited

Loading