Timeout for sync operations #6055

dominykas · 2021-04-19T08:56:58Z

Summary

At the moment, if for whatever reason the sync process gets stuck (e.g. because some resource fails to start up properly and keeps on retrying), the sync will never complete and will keep on "Syncing".

There should be an option to add a timeout, after which the sync process would terminate. Depending on selfHeal rules, etc, there may be a need to automatically retry, or alternatively, the application should just stay in the failed state until manually resolved.

Did my best to search for similar requests, aside from a brief note in #1886, couldn't find anything - sorry if I missed it.

Motivation

At the moment, we've set up alerting for sync operations that are taking too long, which at least notifies someone to look at things and usually means a manual intervention.

When an application is in a "Syncing" state, manual intervention becomes rather tricky - one cannot delete resource to get them recreated (esp. when things are stuck in some sync wave), or perform a partial sync, etc.

Moreover, simply hitting "Terminate" is not always sufficient if the application has autosync enabled, as it would just retry, putting it into a forever "Syncing" state. Disabling autosync in some cases might also be problematic and require multiple steps, because it might be set from a parent application - which means that the parent application autosync also needs to be disabled (so that it does not just resync and re-enable the autosync).

Proposal

syncPolicy:
    syncTimeout: 600 # seconds, default: unlimited
    onSyncTimeout: "fail" # or "retry" (?), or "waitForUpdate" (?)

Some of the things that might need consideration:

Should selfHeal just retry? Or should that be configurable? The previous sync might not have completed in full, so hooks/postsync actions might not have executed.
Should new commits result in a new sync operation? Same as above, essentially. Arguably, new commits could be the fix.

The text was updated successfully, but these errors were encountered:

RaviHari · 2021-10-20T17:43:12Z

I would like to work on this issue.

hanzala1234 · 2022-04-13T10:39:21Z

Is there any update on that?

prima101112 · 2022-04-20T09:36:00Z

@RaviHari is there any update on this. been in this issue because pre-hook failed and its locked to always sync state

RaviHari · 2022-04-20T10:05:14Z

@prima101112 and @hanzala1234 sorry for delay.. I will get started on this and keep you posted in this thread.

LS80 · 2022-06-09T14:44:44Z

@RaviHari Did you get round to starting on this?

grezar · 2022-07-29T08:51:42Z

+1

yabeenico · 2022-08-12T04:57:41Z

+1

crenshaw-dev · 2022-08-12T14:22:47Z

Moreover, simply hitting "Terminate" is not always sufficient

I've also seen "Terminate" simply cause the sync operation to get stuck in "Terminating." This was in an app with ~1k resources.

If Ravi or anyone else puts up a PR, I'd be happy to review.

pritam-acquia · 2022-10-28T11:59:49Z

+1

mhonorio · 2022-11-29T15:33:28Z

Looking forward to this feature too. I have a lot of applications getting stuck and timeout would be great to not block the others resources that it's not related.

neiser · 2023-01-03T10:04:50Z

It seems like @RaviHari has lost interest in this, at least he stopped responding. We'd still appreciate that feature very much (we're using the app-of-apps pattern and sometimes it just gets stuck, and a timeout would really help). Any chance someone else can implement this?

Sayrus · 2023-02-21T15:47:44Z

To work around sync being stuck due to hooks or operations taking too long, I've implemented the following:

Sayrus@817bc34

It's equivalent to clicking Terminate after reaching the timeout. This will end up as a Sync Failed thus blocking self healing from auto syncing the application (Skipping auto-sync: failed previous sync attempt to xxxx). This is probably not the best way to do it but it works.

LS80 · 2023-10-09T18:44:35Z

Another way to work around it is to run the following as a CronJob.

from datetime import datetime, timedelta
import logging
import os
import sys

from kubernetes import client, config
import requests

logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')

try:
    timeout_minutes = int(sys.argv[1])
except IndexError:
    timeout_minutes = 60

argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']

config.load_incluster_config()

api = client.CustomObjectsApi()

apps = api.list_namespaced_custom_object(
    group='argoproj.io',
    version='v1alpha1',
    namespace='argocd',
    plural='applications'
)['items']

syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']

def apps_to_timeout():
    now = datetime.utcnow()
    logging.debug(f"Time now {now.isoformat()}")

    for app in syncing_apps:
        app_name = app['metadata']['name']
        sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'].removesuffix('Z'))
        logging.debug(f"App '{app_name}' started syncing at {sync_started.isoformat()}")

        if now - sync_started > timedelta(minutes=timeout_minutes):
            yield app_name

apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")

session = requests.session()
session.cookies.set('argocd.token', argocd_token)

for app_name in apps:
    session.delete(f"https://{argocd_server}/api/v1/applications/{app_name}/operation")
    logging.info(f"Terminated sync operation for '{app_name}'")

aslafy-z · 2023-10-09T21:07:34Z

@alexec would you mind giving a look to #15603?

travis-jorge · 2024-07-19T13:33:28Z

Has there been any progress on implementing this? We have this issue daily.

riuvshyn · 2024-10-04T16:32:35Z

same here, we are using external cron job to detect and terminate "stuck" syncs which is very inconvenient and painful to maintain. Been waiting for this for years already 🙏🏽

jessebye · 2024-10-07T19:52:14Z

@riuvshyn could you share the cronjob? 🙏 we could really use that while waiting for this feature to get implemented.

LS80 · 2024-10-08T11:56:07Z

We currently have this as a CronJob.

from datetime import datetime, timedelta, UTC
import logging
import os
import sys

from kubernetes import client, config
import requests

logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')
logging.getLogger("kubernetes.client.rest").setLevel(os.environ.get("KUBE_LOG_LEVEL", "info").upper())

try:
    timeout_minutes = int(sys.argv[1])
except IndexError:
    timeout_minutes = 60

argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']

try:
    config.load_incluster_config()
except config.config_exception.ConfigException:
    try:
        config.load_kube_config()
    except config.config_exception.ConfigException:
        raise Exception("Could not configure kubernetes client.")

api = client.CustomObjectsApi()

apps = api.list_cluster_custom_object(
    group='argoproj.io',
    version='v1alpha1',
    plural='applications'
)['items']

syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']

def apps_to_timeout():
    now = datetime.now(UTC)
    logging.debug(f"Time now {now.isoformat()}")

    for app in syncing_apps:
        app_name = app['metadata']['name']
        app_namespace = app['metadata']['namespace']
        sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'])
        logging.debug(f"App '{app_namespace}/{app_name}' started syncing at {sync_started.isoformat()}")

        if now - sync_started > timedelta(minutes=timeout_minutes):
            yield app_namespace, app_name

apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")

session = requests.session()
session.cookies.set('argocd.token', argocd_token)
session.headers.update({'Content-Type': 'application/json'})

responses = []
for app_namespace, app_name in apps:
    logging.debug(f"Terminating sync operation for '{app_namespace}/{app_name}'")
    response = session.delete(
        f"https://{argocd_server}/api/v1/applications/{app_name}/operation",
        params={'appNamespace': app_namespace}
    )
    logging.debug(f"[{response.status_code}] {response.text}")
    logging.info(f"Terminated sync operation for '{app_namespace}/{app_name}'")
    responses.append(response)

if not all(response.ok for response in responses):
    logging.error("Some sync operations failed to terminate")
    sys.exit(1)

philipp-durrer-jarowa · 2024-11-11T15:03:25Z

80+ people waiting on this... would be really great if ArgoCD doesn't get stuck in Sync status if it immediately encounters errors from any of the resource application commands...

Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

andrii-korotkov-verkada · 2024-11-17T13:47:30Z

I've took a stab on this with a support for a controllel-level configured timeout. If this lands well, I can also work on application-specific overrides.

Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* feat: Sync timeouts for applications (#6055) Helps with #6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * Fix env variable name Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> --------- Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

* feat: Sync timeouts for applications (argoproj#6055) Helps with argoproj#6055 Introduces a controller-level configuration for terminating sync after timeout. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * Fix env variable name Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> --------- Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

dominykas added the enhancement New feature or request label Apr 19, 2021

alexmt added the hacktoberfest label Oct 1, 2021

alexmt assigned RaviHari Oct 20, 2021

crenshaw-dev removed the hacktoberfest label Sep 20, 2023

crenshaw-dev unassigned RaviHari Sep 20, 2023

aslafy-z mentioned this issue Oct 9, 2023

feat: auto-refresh on new revisions during sync retries (Alpha) (#11494) #15603

Open

13 tasks

alexmt mentioned this issue Dec 17, 2023

docs: proposal to implement sync timeout and termination settings #16630

Merged

sherifabdlnaby mentioned this issue Apr 5, 2024

Retrying failed sync's block newer commits; how to achieve declarative, level based gitops semantics? #11494

Open

3 tasks

alexmt added component:argo-cd type:enhancement labels Jul 19, 2024

crenshaw-dev added the sync-waves label Oct 6, 2024

jkleinlercher mentioned this issue Oct 25, 2024

install-platform should get more resilient (retry failed syncs) suxess-it/kubriX#766

Closed

andrii-korotkov-verkada self-assigned this Nov 17, 2024

andrii-korotkov-verkada mentioned this issue Nov 17, 2024

feat: Sync timeouts for applications (#6055) #20816

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout for sync operations #6055

Timeout for sync operations #6055

dominykas commented Apr 19, 2021 •

edited

Loading

RaviHari commented Oct 20, 2021

hanzala1234 commented Apr 13, 2022

prima101112 commented Apr 20, 2022

RaviHari commented Apr 20, 2022

LS80 commented Jun 9, 2022

grezar commented Jul 29, 2022

yabeenico commented Aug 12, 2022

crenshaw-dev commented Aug 12, 2022 •

edited

Loading

pritam-acquia commented Oct 28, 2022

mhonorio commented Nov 29, 2022

neiser commented Jan 3, 2023

Sayrus commented Feb 21, 2023

LS80 commented Oct 9, 2023

aslafy-z commented Oct 9, 2023

travis-jorge commented Jul 19, 2024

riuvshyn commented Oct 4, 2024

jessebye commented Oct 7, 2024

LS80 commented Oct 8, 2024

philipp-durrer-jarowa commented Nov 11, 2024

andrii-korotkov-verkada commented Nov 17, 2024

Timeout for sync operations #6055

Timeout for sync operations #6055

Comments

dominykas commented Apr 19, 2021 • edited Loading

Summary

Motivation

Proposal

RaviHari commented Oct 20, 2021

hanzala1234 commented Apr 13, 2022

prima101112 commented Apr 20, 2022

RaviHari commented Apr 20, 2022

LS80 commented Jun 9, 2022

grezar commented Jul 29, 2022

yabeenico commented Aug 12, 2022

crenshaw-dev commented Aug 12, 2022 • edited Loading

pritam-acquia commented Oct 28, 2022

mhonorio commented Nov 29, 2022

neiser commented Jan 3, 2023

Sayrus commented Feb 21, 2023

LS80 commented Oct 9, 2023

aslafy-z commented Oct 9, 2023

travis-jorge commented Jul 19, 2024

riuvshyn commented Oct 4, 2024

jessebye commented Oct 7, 2024

LS80 commented Oct 8, 2024

philipp-durrer-jarowa commented Nov 11, 2024

andrii-korotkov-verkada commented Nov 17, 2024

dominykas commented Apr 19, 2021 •

edited

Loading

crenshaw-dev commented Aug 12, 2022 •

edited

Loading