-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout for sync operations #6055
Comments
I would like to work on this issue. |
Is there any update on that? |
@RaviHari is there any update on this. been in this issue because pre-hook failed and its locked to always sync state |
@prima101112 and @hanzala1234 sorry for delay.. I will get started on this and keep you posted in this thread. |
@RaviHari Did you get round to starting on this? |
+1 |
1 similar comment
+1 |
I've also seen "Terminate" simply cause the sync operation to get stuck in "Terminating." This was in an app with ~1k resources. If Ravi or anyone else puts up a PR, I'd be happy to review. |
+1 |
Looking forward to this feature too. I have a lot of applications getting stuck and timeout would be great to not block the others resources that it's not related. |
It seems like @RaviHari has lost interest in this, at least he stopped responding. We'd still appreciate that feature very much (we're using the app-of-apps pattern and sometimes it just gets stuck, and a timeout would really help). Any chance someone else can implement this? |
To work around sync being stuck due to hooks or operations taking too long, I've implemented the following: It's equivalent to clicking |
Another way to work around it is to run the following as a CronJob. from datetime import datetime, timedelta
import logging
import os
import sys
from kubernetes import client, config
import requests
logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')
try:
timeout_minutes = int(sys.argv[1])
except IndexError:
timeout_minutes = 60
argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']
config.load_incluster_config()
api = client.CustomObjectsApi()
apps = api.list_namespaced_custom_object(
group='argoproj.io',
version='v1alpha1',
namespace='argocd',
plural='applications'
)['items']
syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']
def apps_to_timeout():
now = datetime.utcnow()
logging.debug(f"Time now {now.isoformat()}")
for app in syncing_apps:
app_name = app['metadata']['name']
sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'].removesuffix('Z'))
logging.debug(f"App '{app_name}' started syncing at {sync_started.isoformat()}")
if now - sync_started > timedelta(minutes=timeout_minutes):
yield app_name
apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")
session = requests.session()
session.cookies.set('argocd.token', argocd_token)
for app_name in apps:
session.delete(f"https://{argocd_server}/api/v1/applications/{app_name}/operation")
logging.info(f"Terminated sync operation for '{app_name}'") |
Has there been any progress on implementing this? We have this issue daily. |
same here, we are using external cron job to detect and terminate "stuck" syncs which is very inconvenient and painful to maintain. Been waiting for this for years already 🙏🏽 |
@riuvshyn could you share the cronjob? 🙏 we could really use that while waiting for this feature to get implemented. |
We currently have this as a CronJob. from datetime import datetime, timedelta, UTC
import logging
import os
import sys
from kubernetes import client, config
import requests
logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')
logging.getLogger("kubernetes.client.rest").setLevel(os.environ.get("KUBE_LOG_LEVEL", "info").upper())
try:
timeout_minutes = int(sys.argv[1])
except IndexError:
timeout_minutes = 60
argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']
try:
config.load_incluster_config()
except config.config_exception.ConfigException:
try:
config.load_kube_config()
except config.config_exception.ConfigException:
raise Exception("Could not configure kubernetes client.")
api = client.CustomObjectsApi()
apps = api.list_cluster_custom_object(
group='argoproj.io',
version='v1alpha1',
plural='applications'
)['items']
syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']
def apps_to_timeout():
now = datetime.now(UTC)
logging.debug(f"Time now {now.isoformat()}")
for app in syncing_apps:
app_name = app['metadata']['name']
app_namespace = app['metadata']['namespace']
sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'])
logging.debug(f"App '{app_namespace}/{app_name}' started syncing at {sync_started.isoformat()}")
if now - sync_started > timedelta(minutes=timeout_minutes):
yield app_namespace, app_name
apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")
session = requests.session()
session.cookies.set('argocd.token', argocd_token)
session.headers.update({'Content-Type': 'application/json'})
responses = []
for app_namespace, app_name in apps:
logging.debug(f"Terminating sync operation for '{app_namespace}/{app_name}'")
response = session.delete(
f"https://{argocd_server}/api/v1/applications/{app_name}/operation",
params={'appNamespace': app_namespace}
)
logging.debug(f"[{response.status_code}] {response.text}")
logging.info(f"Terminated sync operation for '{app_namespace}/{app_name}'")
responses.append(response)
if not all(response.ok for response in responses):
logging.error("Some sync operations failed to terminate")
sys.exit(1) |
80+ people waiting on this... would be really great if ArgoCD doesn't get stuck in Sync status if it immediately encounters errors from any of the resource application commands... |
Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
I've took a stab on this with a support for a controllel-level configured timeout. If this lands well, I can also work on application-specific overrides. |
Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
* feat: Sync timeouts for applications (#6055) Helps with #6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * Fix env variable name Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> --------- Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
* feat: Sync timeouts for applications (argoproj#6055) Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * Fix env variable name Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> --------- Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Summary
At the moment, if for whatever reason the sync process gets stuck (e.g. because some resource fails to start up properly and keeps on retrying), the sync will never complete and will keep on "Syncing".
There should be an option to add a timeout, after which the sync process would terminate. Depending on
selfHeal
rules, etc, there may be a need to automatically retry, or alternatively, the application should just stay in the failed state until manually resolved.Did my best to search for similar requests, aside from a brief note in #1886, couldn't find anything - sorry if I missed it.
Motivation
At the moment, we've set up alerting for sync operations that are taking too long, which at least notifies someone to look at things and usually means a manual intervention.
When an application is in a "Syncing" state, manual intervention becomes rather tricky - one cannot delete resource to get them recreated (esp. when things are stuck in some sync wave), or perform a partial sync, etc.
Moreover, simply hitting "Terminate" is not always sufficient if the application has autosync enabled, as it would just retry, putting it into a forever "Syncing" state. Disabling autosync in some cases might also be problematic and require multiple steps, because it might be set from a parent application - which means that the parent application autosync also needs to be disabled (so that it does not just resync and re-enable the autosync).
Proposal
Some of the things that might need consideration:
selfHeal
just retry? Or should that be configurable? The previous sync might not have completed in full, so hooks/postsync actions might not have executed.The text was updated successfully, but these errors were encountered: