workflows list page delay in showing retry status from details page #12868

amfage · 2024-04-02T03:04:05Z

Pre-requisites

I have double-checked my configuration
I can confirm the issue exists when I tested with :latest
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

If a workflow fails, when viewing the workflow and clicking "retry", when returning to the main workflows listing page, the workflow still shows failed, although the workflow is running. Sometimes the workflow listing page will, after a second or maybe 5 seconds, show the correct status, but sometimes it stays displaying the failed status.

The expected behaviour is that the icon shows the blue circle running icon, and the started, finished, duration and message areas show that the workflow is running, as it would if the workflow was running for the first time. This is how it behaved in the previous version we were running which was 3.4.11.

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This example sleeps for 30 seconds then fails, so it can be retried
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-simple-fail-for-retry-
spec:
  entrypoint: template-fail
  templates:
  - name: template-fail
    script:
      image: python:alpine3.6
      command: ["python"]
      # fail
      source: |
        import sys;
        import time;
        time.sleep(30); 
        sys.exit(1)

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-04-03T01:39:54Z

after a second or maybe 5 seconds, show the correct status

That sounds correct. The Workflows List page runs a ListWatch on the visible k8s resources, so there can be a time delay between when the retry occurs and when it shows up in the ListWatch.

The Workflows Details page can have similar issues as it relies on a RetryWatch.

Albeit there is a difference, due to a copy+paste issue, when a retry is started, the page refreshes, same as with resubmit (the code is copied from resubmit, but a retry does not create a new Workflow, so there isn't actually a need to change pages. and a page change should happen via the internal router as well instead of the browser URL). The retry does not necessarily happen immediately though (especially not after #12538), so it might not be caught during the refresh either.

The main way to improve that would be for the UI to keep a cache of Workflows and reconcile them with the Server, that way the Details page and the List page could re-use the data for a faster update, but that would require a sizeable refactor.

amfage · 2024-04-03T02:29:36Z

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

agilgur5 · 2024-04-04T06:31:11Z

but sometimes it stays displaying the failed status.

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I also checked the UI merge logic which seems to be correct afaict

This is how it behaved in the previous version we were running which was 3.4.11.

This part is interesting, suggesting a regression. A few things changed in 3.5 here. The options panel was added for retries in #11632, which also added the hard refresh after a retry. And most of the UI was refactored to use React hooks; in particular, I did a massive refactor of the Workflows List page in #11891 which made it substantially more efficient as there were many duplicate network calls. That PR has had a few (less than one-liner) bugs in it though.
This sounds like one of the duplicate network calls may have been catching this in 3.4.x 🤔 IIRC one of them was a full re-list, which should refresh the state... However the hard refresh after a retry should also end up refreshing the state

technically speaking, the back-end k8s client Informer libraries rebuild their cache on a certain timeframe as well (the "resync period" is every 20min by default), specifically because they might miss an update (cache invalidation is hard)... I wonder if we should perhaps do some similar logic here in the UI 🤔

amfage · 2024-04-16T04:08:01Z

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I am not sure how often it happens but it did happen last week and was consistent for at least 10 minutes over several workflows, but then reverted to the behaviour of updating the status after about 5 seconds just as I was about to troubleshoot it. I do know that some of those workflow retries were stuck in a "Pending" state (InvalidImageName error, but the workflow didn't fail) but I haven't been able to reproduce it since.

amfage added the type/bug label Apr 2, 2024

shuangkun added the area/ui label Apr 2, 2024

agilgur5 changed the title ~~Argo Workflows main workflows list page does not show workflow retry is running~~ workflows list page does not show workflow retry is running Apr 2, 2024

agilgur5 changed the title ~~workflows list page does not show workflow retry is running~~ workflows list page delay in showing retry status from details page Apr 3, 2024

agilgur5 added the P3 Low priority label Apr 3, 2024

agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Apr 4, 2024

agilgur5 mentioned this issue Apr 18, 2024

fix(ui): use router navigation instead of page load after submit #12950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows list page delay in showing retry status from details page #12868

workflows list page delay in showing retry status from details page #12868

amfage commented Apr 2, 2024 •

edited

Loading

agilgur5 commented Apr 3, 2024 •

edited

Loading

amfage commented Apr 3, 2024

agilgur5 commented Apr 4, 2024 •

edited

Loading

amfage commented Apr 16, 2024 •

edited

Loading

workflows list page delay in showing retry status from details page #12868

workflows list page delay in showing retry status from details page #12868

Comments

amfage commented Apr 2, 2024 • edited Loading

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Apr 3, 2024 • edited Loading

amfage commented Apr 3, 2024

agilgur5 commented Apr 4, 2024 • edited Loading

amfage commented Apr 16, 2024 • edited Loading

amfage commented Apr 2, 2024 •

edited

Loading

agilgur5 commented Apr 3, 2024 •

edited

Loading

agilgur5 commented Apr 4, 2024 •

edited

Loading

amfage commented Apr 16, 2024 •

edited

Loading