Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush.agents_autodelete deletes activate agents #3424

Closed
srkoster opened this issue Jan 23, 2025 · 4 comments
Closed

Flush.agents_autodelete deletes activate agents #3424

srkoster opened this issue Jan 23, 2025 · 4 comments
Labels

Comments

@srkoster
Copy link
Contributor

What happened?

I'm running Crowdsec with the crowdsec helm-chart on Kubernetes and am using Postgresql as my database. My agents use an auto registration token to register themselves with the LAPI, so after a while, the list of agents grows and I would like to clean the unused agent entries.

I can do it manually with the cli cscli machines prune which works fine. Things got strange when I activated the agents_autodelete in my db_config:

db_config:
  type:     postgresql
  ...
  flush:
    agents_autodelete:
      login_password: 30m

After about 1 hour all my agent pods where in a crashloop with the message "unable to start crowdsec routines: authenticate watcher (name of pod): API error: ent: machine not found"

I found out that all my machines where deleted from the LAPI (both through cscli machines list and looking in the database).

What did you expect to happen?

Only agents that had a latest heartbeat <= now - 30 minutes should be deleted.
Active agents should be kept in the machines list of the LAPI

How can we reproduce it (as minimally and precisely as possible)?

On version 1.6.4, set the db_config as indicated and wait for 1 hour.

Anything else we need to know?

When analyzing my Postgresql database logs, I found the problem. Every now and then, a query similar to the one below was being issued by crowdsec. It updates the machines table for an agent and also updated the last_heartbeat column. However, it sets the last_heartbeat column with the first timestamp that the agent was seen (retrieved from the metrics). If the flush operation is run after this query, and before the next heartbeat update from the agent, the agent gets deleted falsely.

Image

The origin of the logic for the last_heartbeat update resides in the MachineUpdateBaseMetrics which I believe shouldn't update the last_heartbeat column at all.

Crowdsec version

$ cscli version
version: v1.6.4-523164f6
Codename: alphaga
BuildDate: 2024-11-26_13:16:01
GoVersion: 1.23.3
Platform: docker
libre2: C++
User-Agent: crowdsec/v1.6.4-523164f6-docker
Constraint_parser: >= 1.0, <= 3.0
Constraint_scenario: >= 1.0, <= 3.0
Constraint_api: v1
Constraint_acquis: >= 1.0, < 2.0
Built-in optional components: cscli_setup, datasource_appsec, datasource_cloudwatch, datasource_docker, datasource_file, datasource_http, datasource_journalctl, datasource_k8s-audit, datasource_kafka, datasource_kinesis, datasource_loki, datasource_s3, datasource_syslog, datasource_wineventlog

OS version

# On Linux:
$ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.20.3
PRETTY_NAME="Alpine Linux v3.20"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
Linux crowdsec-lapi-59ddb58f96-dzxcr 6.6.58-talos #1 SMP Mon Oct 28 10:12:50 UTC 2024 x86_64 Linux

Enabled collections and parsers

No response

Acquisition config

No response

Config show

No response

Prometheus metrics

No response

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

No response

@srkoster srkoster added the kind/bug Something isn't working label Jan 23, 2025
Copy link

@srkoster: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@LaurenceJJones
Copy link
Contributor

Hey thank you for the detailed report and the pull request, I tagged as accepted and added it to the next release as it priority is high.

@LaurenceJJones
Copy link
Contributor

fixed by: #3425

Note that the update will be published via 1.6.5, however, for the time being you can point your image tag to :dev which will be built in the next hour to include your fix.

@srkoster
Copy link
Contributor Author

srkoster commented Jan 23, 2025

Sweet, thanks for the quick reaction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants