Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add invalid captcha and messageChannel sync status health monitoring #10029

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

etiennejouan
Copy link
Contributor

@etiennejouan etiennejouan commented Feb 5, 2025

Context :
We want to implement some counters to monitor server health. First counters will track : messageChannel sync status during job execution and invalid captcha.

How :
Counters are stored in cache and grouped by one-minute windows. Controllers are created for each metric, aggregating counter over a five-minutes window.
Endpoints are public and will be queried by Prometheus.

closes twentyhq/core-team-issues#55

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR implements health monitoring for message channel sync status and invalid captcha attempts, with metrics exposed for Prometheus integration.

  • Added HealthCacheService with time-windowed counters (5-minute aggregation window) for tracking message channel sync jobs and invalid captcha attempts
  • Added new endpoints /healthz/message-channel-sync-job-by-status and /healthz/invalid-captcha in HealthController for Prometheus metrics
  • Integrated health monitoring in MessageChannelSyncStatusService to track sync status transitions (ONGOING, ACTIVE, FAILED_UNKNOWN, FAILED_INSUFFICIENT_PERMISSIONS)
  • Added EngineHealth namespace to CacheStorageNamespace enum with 10-minute TTL for health metrics storage

12 file(s) reviewed, 6 comment(s)
Edit PR Review Bot Settings | Greptile

Copy link
Collaborator

@ijreilly ijreilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good to me, but I wonder if the counter is the right strategy: do we have a stable enough traffic to know how to juge absolute numbers? (absolute numbers of invalid captcha for instance)
What will prometheus do with these numbers ? Compare them?

Copy link
Collaborator

@ijreilly ijreilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great !! looks cool

@FelixMalfait
Copy link
Member

This is great! We can merge like this but FYI we probably could have done something more efficient if it was more tightly coupled with Redis (we have an abstraction layer but decided to get rid of other implementations so we might as well couple it now...).

We don't need to optimize for high throughput scenario but I think if we had a very high volume then the number here would be wrong (you do a read then a write so since Node runs on multiple threads you could have two concurrent read/write operations on the same number).

Redis reference for increments:
https://www.tutorialspoint.com/redis/hashes_hincrby.htm
(not concurrent: https://stackoverflow.com/questions/38954590/is-redis-hincrby-atomic)

And with a different implementation we could have used sorted sets with each timestamp as an entry
https://www.w3resource.com/redis/redis-zrangebyscore-key-min-max.php
([https://www.w3resource.com/redis/redis-zremrangebyscore-key-min-max.php to cleanup automatically)
This would give more flexibility for charting but has other downsides.

In any case, it doesn't really matter because we'd still get the good order of magnitude even if there was a concurrency issue! Great work!!! We should definitely display all those metrics in the Server Admin Panel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update monitoring for messaging
4 participants