Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problem with long running phase #1622

Open
blafry opened this issue Sep 5, 2022 · 18 comments · May be fixed by #2972
Open

[BUG] Problem with long running phase #1622

blafry opened this issue Sep 5, 2022 · 18 comments · May be fixed by #2972

Comments

@blafry
Copy link

blafry commented Sep 5, 2022

Describe the bug

When phase takes about 6 hours, the operator (0.81.0) does not set the snapshot_id in the ActionSet despite using kando for this purpose.

To Reproduce
Steps to reproduce the behavior:

Blueprint:

  backup:
    kind: Deployment
    outputArtifacts:
      mysql_backup:
        keyValue:
          snapshot_id: '{{ .Phases.backupMySQL.Output.snapshot_id }}'
    phases:
      - args:
          command:
            - bash
            - '-o'
            - errexit
            - '-o'
            - pipefail
            - '-o'
            - errtrace
            - '-c'
            - |
              sleep 20000
              kando output snapshot_id 6c1b0940
          namespace: mysql
        func: KubeTask
        name: backupMySQL

ActionSet describe:

Name:         backup-mysql-2022-09-04-00-00
Namespace:    mysql
Labels:       <none>
Annotations:  <none>
API Version:  cr.kanister.io/v1alpha1
Kind:         ActionSet

(...)

Spec:
  Actions:
    Blueprint:  service-backups-mysql
    Name:       backup
    Object:
      API Version:      
      Group:            
      Kind:             Deployment
      Name:             platform-operator-mysql
      Namespace:        mysql
      Resource:         
    Preferred Version:  
Status:
  Actions:
    Artifacts:
      mysql_backup:
        Key Value:
          snapshot_id:  {{ .Phases.backupMySQL.Output.snapshot_id }}
    Blueprint:          service-backups-mysql
    Name:               backup
    Object:
      API Version:  
      Group:        
      Kind:         Deployment
      Name:         platform-operator-mysql
      Namespace:    mysql
      Resource:     
    Phases:
      Name:   backupMySQL
      State:  complete
  Error:
    Message:  Failed to render template: "snapshot_id" not found
  State:      failed
Events:       <none>

Operator logs:

2022-09-04T00:00:03.815136760Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).logAndSuccessEvent","Line":643,"hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Executing action backup","time":"2022-09-04T00:00:03.813944563Z"}
2022-09-04T00:00:03.977164547Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).handleActionSet","Line":396,"NewActionSetName":"backup-mysql-2022-09-04-00-00","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Created actionset and started executing actions","time":"2022-09-04T00:00:03.974906861Z"}
2022-09-04T00:00:03.977216450Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).logAndSuccessEvent","Line":643,"Phase":"backupMySQL","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Executing phase backupMySQL","time":"2022-09-04T00:00:03.975206032Z"}
2022-09-04T00:00:03.981810552Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).onUpdateActionSet","Line":238,"Status":"pending","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Updated ActionSet","time":"2022-09-04T00:00:03.97779321Z"}
2022-09-04T00:00:03.981852784Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).onUpdateActionSet","Line":245,"Phase":"backupMySQL-\u003epending","Status":"running","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Updated ActionSet","time":"2022-09-04T00:00:03.977920245Z"}
2022-09-04T05:33:30.055810444Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).logAndSuccessEvent","Line":643,"Phase":"backupMySQL","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Completed phase backupMySQL","time":"2022-09-04T05:33:30.055486706Z"}
2022-09-04T05:33:30.072749945Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).onUpdateActionSet","Line":238,"Status":"failed","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Updated ActionSet","time":"2022-09-04T05:33:30.072434534Z"}
2022-09-04T05:33:30.072820757Z {"ActionSet":"backup-mysql-2022-09-04-00-00","File":"pkg/controller/controller.go","Function":"github.com/kanisterio/kanister/pkg/controller.(*Controller).logAndErrorEvent","Line":626,"Phase":"backupMySQL","error":"Failed to render template: \"snapshot_id\" not found","hostname":"service-operator-kanister-764b48bbb-2m6cx","level":"info","msg":"Failed to render output artifacts","time":"2022-09-04T05:33:30.072552115Z"}

Expected behavior
An artifact has been set, key: snapshot_id value: 6c1b0940

Environment
Kubernetes Version/Provider: 1.22
Cluster Size (#nodes): 1

Additional context
Instead of sleep 20000 in my blueprint there is a mysqldump which output is streamed to the restic repository. Due to the size of the database and limited bandwidth, this is how long it takes. In the case of a shorter time, I was unable to reproduce the problem.

@blafry blafry added the bug label Sep 5, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2022

Thanks for opening this issue 👍. The team will review it shortly.

If this is a bug report, make sure to include clear instructions how on to reproduce the problem with minimal reproducible examples, where possible. If this is a security report, please review our security policy as outlined in SECURITY.md.

If you haven't already, please take a moment to review our project's Code of Conduct document.

@github-actions github-actions bot added the triage label Sep 5, 2022
@infraq infraq added this to To Be Triaged in Kanister Sep 5, 2022
@ihcsim
Copy link
Contributor

ihcsim commented Sep 29, 2022

@blafry thanks for reporting this issue. We are still trying to reproduce this; will keep you updated once we make progress. Meanwhile, feel free to join Slack, if you have other questions.

@blafry
Copy link
Author

blafry commented Nov 7, 2022

@ihcsim hi, did you reproduce the problem?

@viveksinghggits
Copy link
Contributor

Hi @blafry,
We were able to reproduce the problem, and also decided who is going to look into this soon. I will update the progress here.

@pavannd1 pavannd1 removed the triage label Nov 7, 2022
@pavannd1
Copy link
Contributor

pavannd1 commented Nov 9, 2022

@akankshakumari393 Please take this one

@akankshakumari393
Copy link
Contributor

We have started working on the issue, but unfortunately we don't have the resolution yet. I will keep updating the progress here.

@github-actions
Copy link
Contributor

This issue is marked as stale due to inactivity. Add a new comment to reactivate it.

@github-actions github-actions bot added the stale label Jan 28, 2023
@blafry
Copy link
Author

blafry commented Jan 30, 2023

@akankshakumari393 hello, any progress?

@akankshakumari393
Copy link
Contributor

We were able to reproduce the issue, it happens only in cases where the action keeps running for more than 5 hrs. We are still looking into it.

@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2023

This issue is marked as stale due to inactivity. Add a new comment to reactivate it.

@github-actions github-actions bot added the stale label May 1, 2023
@pavannd1 pavannd1 added frozen and removed stale labels May 24, 2023
@maksym-herbert
Copy link

Any updates?

1 similar comment
@DCkQ6
Copy link

DCkQ6 commented Mar 21, 2024

Any updates?

@viveksinghggits
Copy link
Contributor

viveksinghggits commented Mar 21, 2024

Hi @DCkQ6 ,
We will have to re-prioritise this issue. Can you please share how long your phase is expected to take to get completed successfully?

@DCkQ6
Copy link

DCkQ6 commented Mar 21, 2024

In the worst case backup of our problematic huge database takes about 7h

@viveksinghggits
Copy link
Contributor

In the worst case backup of our problematic huge database takes about 7h

Thank you, I will make sure that I bring this up in coming community call.

@blafry
Copy link
Author

blafry commented May 14, 2024

Hi @viveksinghggits! Any updates after community call?

@viveksinghggits
Copy link
Contributor

Hi @blafry ,
Yes, we did discuss this and decided that we should accept it and work on this, that's why the accepted label was added. The problem is we have not prioritised this right now. Do you have some familiarity with Go, would you like to take a stab at it? I would be more than happy to help you with any questions you have.

@mlavi
Copy link
Contributor

mlavi commented May 23, 2024

As discussed in today's community meeting, this also came up recently in our Slack channel with a proposed solution. @e-sumin can you take a look?

@DCkQ6 DCkQ6 linked a pull request Jul 3, 2024 that will close this issue
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Kanister
To Be Triaged
Development

Successfully merging a pull request may close this issue.

9 participants