Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template parameter workflow.creationTimestamp is incorrect when a workflow is retried #11488

Closed
2 of 3 tasks
Gerrit-K opened this issue Jul 31, 2023 · 10 comments
Closed
2 of 3 tasks
Labels
area/templating Templating with `{{...}}` P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@Gerrit-K
Copy link

Gerrit-K commented Jul 31, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When using workflow.creationTimestamp inside a template, I would expect it to match the metadata.creationTimestamp field of the argoproj.io/v1alpha1.Workflow object. This is the case for the first run, but if the workflow fails and is retried, the template variable is updated with the current datetime, while the kubernetes metadata field still shows the old timestamp.

To reproduce:

  1. apply the MRE below
  2. let it fail
  3. compare workflow metadata with container logs -> the timestamps match
  4. wait a bit
  5. retry the workflow via the UI
  6. compare the two timestamps again -> the workflow metadata hasn't changed, but the logs show a newer timestamp

Version

latest (v3.4.9)

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-template-whalesay-template
spec:
  entrypoint: whalesay-template
  templates:
  - name: whalesay-template
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [sh, -c]
      args: ["cowsay '{{inputs.parameters.message}}' && exit 1"]
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: testing
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    steps:
      - - name: call-whalesay-template
          templateRef:
            name: workflow-template-whalesay-template
            template: whalesay-template
          arguments:
            parameters:
            - name: message
              value: "{{=sprig.date('2006-01-02 15:04:05', workflow.creationTimestamp)}}"
---

Logs from the workflow controller

time="2023-07-31T14:02:13.429Z" level=info msg="Processing workflow" namespace=testing workflow=testing
time="2023-07-31T14:02:13.443Z" level=info msg="Updated phase  -> Running" namespace=testing workflow=testing
time="2023-07-31T14:02:13.443Z" level=info msg="Steps node testing initialized Running" namespace=testing workflow=testing
time="2023-07-31T14:02:13.443Z" level=info msg="StepGroup node testing-896689527 initialized Running" namespace=testing workflow=testing
time="2023-07-31T14:02:13.443Z" level=info msg="Pod node testing-1931908451 initialized Pending" namespace=testing workflow=testing
time="2023-07-31T14:02:13.451Z" level=info msg="Created pod: testing[0].call-whalesay-template (testing-whalesay-template-1931908451)" namespace=testing workflow=testing
time="2023-07-31T14:02:13.451Z" level=info msg="Workflow step group node testing-896689527 not yet completed" namespace=testing workflow=testing
time="2023-07-31T14:02:13.451Z" level=info msg="TaskSet Reconciliation" namespace=testing workflow=testing
time="2023-07-31T14:02:13.451Z" level=info msg=reconcileAgentPod namespace=testing workflow=testing
time="2023-07-31T14:02:13.456Z" level=info msg="Workflow update successful" namespace=testing phase=Running resourceVersion=3653332 workflow=testing
time="2023-07-31T14:02:23.454Z" level=info msg="Processing workflow" namespace=testing workflow=testing
time="2023-07-31T14:02:23.454Z" level=info msg="Task-result reconciliation" namespace=testing numObjs=0 workflow=testing
time="2023-07-31T14:02:23.455Z" level=info msg="Pod failed: Error (exit code 1)" displayName=call-whalesay-template namespace=testing pod=testing-whalesay-template-1931908451 templateName= workflow=testing
time="2023-07-31T14:02:23.455Z" level=info msg="node changed" namespace=testing new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=testing-1931908451 old.message= old.phase=Pending old.progress=0/1 workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Step group node testing-896689527 deemed failed: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing-896689527 phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing-896689527 message: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing-896689527 finished: 2023-07-31 14:02:23.456108297 +0000 UTC" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="step group testing-896689527 was unsuccessful: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Outbound nodes of testing-1931908451 is [testing-1931908451]" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Outbound nodes of testing is [testing-1931908451]" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing message: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="node testing finished: 2023-07-31 14:02:23.456457422 +0000 UTC" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Checking daemoned children of testing" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="TaskSet Reconciliation" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg=reconcileAgentPod namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Updated phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Updated message  -> child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Marking workflow completed" namespace=testing workflow=testing
time="2023-07-31T14:02:23.456Z" level=info msg="Checking daemoned children of " namespace=testing workflow=testing
time="2023-07-31T14:02:23.462Z" level=info msg="cleaning up pod" action=deletePod key=testing/testing-1340600742-agent/deletePod
time="2023-07-31T14:02:23.476Z" level=info msg="Workflow update successful" namespace=testing phase=Failed resourceVersion=3653368 workflow=testing
time="2023-07-31T14:02:23.488Z" level=info msg="cleaning up pod" action=labelPodCompleted key=testing/testing-whalesay-template-1931908451/labelPodCompleted
time="2023-07-31T14:02:50.939Z" level=info msg="Processing workflow" namespace=testing workflow=testing
time="2023-07-31T14:02:50.940Z" level=info msg="Task-result reconciliation" namespace=testing numObjs=0 workflow=testing
time="2023-07-31T14:02:50.941Z" level=info msg="Pod node testing-1931908451 initialized Pending" namespace=testing workflow=testing
time="2023-07-31T14:02:50.951Z" level=info msg="Created pod: testing[0].call-whalesay-template (testing-whalesay-template-1931908451)" namespace=testing workflow=testing
time="2023-07-31T14:02:50.951Z" level=info msg="Workflow step group node testing-896689527 not yet completed" namespace=testing workflow=testing
time="2023-07-31T14:02:50.951Z" level=info msg="TaskSet Reconciliation" namespace=testing workflow=testing
time="2023-07-31T14:02:50.951Z" level=info msg=reconcileAgentPod namespace=testing workflow=testing
time="2023-07-31T14:02:50.958Z" level=info msg="Workflow update successful" namespace=testing phase=Running resourceVersion=3653418 workflow=testing
time="2023-07-31T14:03:00.953Z" level=info msg="Processing workflow" namespace=testing workflow=testing
time="2023-07-31T14:03:00.953Z" level=info msg="Task-result reconciliation" namespace=testing numObjs=0 workflow=testing
time="2023-07-31T14:03:00.954Z" level=info msg="Pod failed: Error (exit code 1)" displayName=call-whalesay-template namespace=testing pod=testing-whalesay-template-1931908451 templateName= workflow=testing
time="2023-07-31T14:03:00.954Z" level=info msg="node changed" namespace=testing new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=testing-1931908451 old.message= old.phase=Pending old.progress=0/1 workflow=testing
time="2023-07-31T14:03:00.955Z" level=info msg="Step group node testing-896689527 deemed failed: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing-896689527 phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing-896689527 message: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing-896689527 finished: 2023-07-31 14:03:00.960306509 +0000 UTC" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="step group testing-896689527 was unsuccessful: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Outbound nodes of testing-1931908451 is [testing-1931908451]" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Outbound nodes of testing is [testing-1931908451]" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing message: child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="node testing finished: 2023-07-31 14:03:00.96065455 +0000 UTC" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Checking daemoned children of testing" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="TaskSet Reconciliation" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg=reconcileAgentPod namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Updated phase Running -> Failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Updated message  -> child 'testing-1931908451' failed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Marking workflow completed" namespace=testing workflow=testing
time="2023-07-31T14:03:00.960Z" level=info msg="Checking daemoned children of " namespace=testing workflow=testing
time="2023-07-31T14:03:00.971Z" level=info msg="cleaning up pod" action=deletePod key=testing/testing-1340600742-agent/deletePod
time="2023-07-31T14:03:00.985Z" level=info msg="Workflow update successful" namespace=testing phase=Failed resourceVersion=3653450 workflow=testing
time="2023-07-31T14:03:00.994Z" level=info msg="cleaning up pod" action=labelPodCompleted key=testing/testing-whalesay-template-1931908451/labelPodCompleted

Logs from in your workflow's wait container

-
@Gerrit-K
Copy link
Author

I just tried to see if I could perhaps use workflow.scheduledTime to temporarily circumvent this (since my use-case actually uses a CronWorkflow), but it seems that this field is also affected by this bug(?).

@terrytangyuan
Copy link
Member

I think that's expected since the workflow was created previously. It never got deleted. The behavior you are expecting is for resubmitted workflows.

@Gerrit-K
Copy link
Author

@terrytangyuan I think there is a misunderstanding here. To make it clear, when retrying (not resubmitting) a workflow:

My expectation is: that the workflow.creationTimestamp does not change between retry attempts. It should always match the metadata.creationTimestamp value.

My observation is: the workflow.creationTimestamp changes with each retry, even though the workflow doesn't get recreated.

@sarabala1979
Copy link
Member

@Gerrit-K your expectation is correct it should not change for retry. Do you like to fix this issue? If yes, please let us know if you need any help

@terrytangyuan
Copy link
Member

@Gerrit-K Thanks for the clarification!

@sarabala1979
Copy link
Member

@Gerrit-K I can see the code is using metadata.creationTimestamp. Can you check WF metadata.creationTimestamp getting updated?

woc.globalParams[common.GlobalVarWorkflowCreationTimestamp] = woc.wf.ObjectMeta.CreationTimestamp.Format(time.RFC3339)

@Gerrit-K
Copy link
Author

Gerrit-K commented Aug 1, 2023

@terrytangyuan @sarabala1979 Thanks for getting back to this so quickly!

I'm not averse to contributing. However, right now I only had the capacity to report, but not to further investigate. I'll try to find some time to get back to this, setup a dev environment and investigate during the next days/weeks. But if someone, who already has everything set up, is able to quickly test this, I would definitely appreciate it! 🙏

@juliev0 juliev0 added the P3 Low priority label Aug 3, 2023
@Gerrit-K
Copy link
Author

Gerrit-K commented Aug 5, 2023

I found a bit of time today to look into this. It appears that this is a combination of issues with argo-workflows and the sprig.date() function, which I've used in my example.

In argo-workflows, the variable workflow.creationTimestamp is a string, but in the template.Replace function, it gets parsed into a complex interface/map, due to the subkeys, like workflow.creationTimestamp.RFC3339. You can see that when you compare replaceMap with the resulting env there.

When this is evaluated in expr/sprig, it means that workflow.creationTimestamp is not of type time.Time, which causes sprig to revert to a default case to just print time.now() instead.

In my particular case, there are two rather easy workarounds:

  1. Ignore the time format and simply print the timestamp as is: {{workflow.creationTimestamp}}
  2. Convert the timestamp to a date first before formatting: {{=sprig.date('2006-01-02 15:04:05', sprig.toDate('2006-01-02T15:04:05Z', workflow.creationTimestamp.RFC3339))}} (passing just workflow.creationTimestamp here without .RFC3339 doesn't work, as it's not a string)

IMO, sprig.date() should not silently fall back to now() when you pass an invalid type but return an error instead, but that's another story. However, workflow.creationTimestamp should also be a plain string instead of an interface/map. I'm not sure if this is something that should (and could) be fixed, so I'd leave it up to you. Since I now know that I have an easy workaround at hand, it's not as urgent for me, so feel free to close it if there's nothing to fix from your point of view.

@agilgur5 agilgur5 added area/templating Templating with `{{...}}` solution/workaround There's a workaround, might not be great, but exists labels Aug 10, 2023
@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@agilgur5
Copy link
Member

agilgur5 commented Oct 4, 2023

Thanks for digging into this @Gerrit-K !

In argo-workflows, the variable workflow.creationTimestamp is a string, but in the template.Replace function, it gets parsed into a complex interface/map, due to the subkeys, like workflow.creationTimestamp.RFC3339. You can see that when you compare replaceMap with the resulting env there.

When this is evaluated in expr/sprig, it means that workflow.creationTimestamp is not of type time.Time, which causes sprig to revert to a default case to just print time.now() instead.

That is a bit confusing, no doubt, but since it does have subkeys I suppose that kind of makes sense. They are mentioned in the variables page too.

However, workflow.creationTimestamp should also be a plain string instead of an interface/map. I'm not sure if this is something that should (and could) be fixed, so I'd leave it up to you. Since I now know that I have an easy workaround at hand, it's not as urgent for me, so feel free to close it if there's nothing to fix from your point of view.

Thanks for the consideration. This would a breaking change to modify, so I think we should leave it as is, at least for now. Your workaround makes sense and it's not a common issue that I've seen, so changing it may create more confusion.

Will close this out for now as such.

IMO, sprig.date() should not silently fall back to now() when you pass an invalid type but return an error instead, but that's another story.

Yea sprig's default behavior is definitely a bit confusing. Enough so that there is an explicit warning about sprig's error handling in the Argo docs: https://argoproj.github.io/argo-workflows/variables/#expression.

See also #11608 where expr is adding some new features to support things like dates better.
Argo 3.5 has expr's new 1.14 version (#11605) which supports a few date functions like now, duration, and date for formatting.

@agilgur5 agilgur5 closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/templating Templating with `{{...}}` P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

No branches or pull requests

5 participants