-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: health-checks for Elasticsearch #13322
Conversation
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #13322 +/- ##
==========================================
- Coverage 49.01% 49.01% -0.01%
==========================================
Files 247 247
Lines 42685 42685
==========================================
- Hits 20924 20921 -3
- Misses 19645 19647 +2
- Partials 2116 2117 +1 see 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Thomas Decaux <ebuildy@gmail.com> Signed-off-by: ebuildy <ebuildy@gmail.com>
I agree, the yellow state in an Elasticsearch cluster should not be considered an error state. In Elasticsearch, cluster health is represented by three states: green, yellow, and red. A green state means that all primary and replica shards are active and allocated in the cluster. A yellow state means that all primary shards are active and allocated, but some replica shards are either missing or initializing. A red state means that one or more primary shards are missing or failed to allocate, indicating a serious problem. While a yellow state indicates that there may be some issues with replica shard allocation, it does not necessarily mean that there is a critical error in the cluster. Depending on your specific use case and requirements, a yellow state may be acceptable and normal. However, if the yellow state persists for an extended period of time or is accompanied by other issues, it may be worth investigating further to ensure the stability and reliability of the Elasticsearch cluster. That said I am fine with this PR. /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ebuildy!
* fix: health-checks for Elasticsearch Signed-off-by: Thomas Decaux <ebuildy@gmail.com> Signed-off-by: ebuildy <ebuildy@gmail.com> * fix tests Signed-off-by: ebuildy <ebuildy@gmail.com> --------- Signed-off-by: Thomas Decaux <ebuildy@gmail.com> Signed-off-by: ebuildy <ebuildy@gmail.com>
* fix: health-checks for Elasticsearch Signed-off-by: Thomas Decaux <ebuildy@gmail.com> Signed-off-by: ebuildy <ebuildy@gmail.com> * fix tests Signed-off-by: ebuildy <ebuildy@gmail.com> --------- Signed-off-by: Thomas Decaux <ebuildy@gmail.com> Signed-off-by: ebuildy <ebuildy@gmail.com>
argocd application go to "degraged" state as soon as elasticsearch cluster health go to yellow state. Which is wrong, yellow state should not be considered as a broken state:
"when a new index is created, the cluster will go to yellow state, because replicas are not ready"
This can flood notifications with wrong alerts, because elasticsearch can create / move indices all the time (we got more than 10 wrong alerts per day).
To me, this is a bug on elasticsearch side (I mean cluster health check should wait a little time before changing color), but according discussion with elastic team at https://discuss.elastic.co/t/cluster-health-wrong-yellow-spikes-because-new-index/330124/6, this is normal, and yellow "spike" should not be considered as error.
The current argocd health check implementation is reporting error as soon as elasticsearch cluster health changes to yellow, I edit it to consider yellow state as "progressing" app state.
A better solution will be to add a kind a buffer, in order to give some time before the argocd application degrades.
Relate: