Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between different VDDK validation errors #969

Merged
merged 6 commits into from
Oct 8, 2024

Commits on Sep 19, 2024

  1. Don't pass around the vddk image url unless necessary

    Several functions accept a vddk image argument even though the vddk
    image can be retrieved directly from the plan.
    
    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    8962235 View commit details
    Browse the repository at this point in the history
  2. factor out the code for validating the vddk image validation Job

    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    e0d014d View commit details
    Browse the repository at this point in the history
  3. vddk validation: return errors if providers aren't set

    This code previously returned nil if the source and destination
    providers were not set for the plan when validating the vddk image, but
    it seems to make more sense to return an error instead.
    
    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    4800b77 View commit details
    Browse the repository at this point in the history
  4. Don't pass labels to createVddkCheckJob()

    Rather than passing the labels to the function, just query it using the
    utility function.
    
    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    9dff5ca View commit details
    Browse the repository at this point in the history
  5. vddk validation: don't restart validator pod on failure

    If the vddk validator pod fails, we don't need to keep re-trying. The
    container simply checks for the existence of a file, so restarting
    the pod is unlikely to change anything.
    
    In addition, by specifying `Never` for the restart policy, the completed
    pod should be retained for examination after the job fails, which can be
    helpful for determining the cause of failure.
    
    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    f2a5843 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. Distinguish between different VDDK validation errors

    There are multiple cases that can lead to a "VDDK Init image is invalid"
    error message for a migration plan. They are currently handled with a
    single VDDKInvalid condition. One of the most common is when the vddk
    image cannot be pulled (either due to network issues or due to the user
    typing an incorrect image URL). Categorizing this type of error as an
    "invalid VDDK image" is confusing to the user.
    
    When the initContainer cannot pull the VDDK init image, the
    vddk-validator-* pod has something like the following status:
      initContainerStatuses:
        - name: vddk-side-car
          state:
            waiting:
              reason: ErrImagePull
              message: 'reading manifest 8.0.3.14 in default-route-openshift-image-registry.apps-crc.testing/openshift/vddk: manifest unknown'
          lastState: {}
          ready: false
          restartCount: 0
          image: 'default-route-openshift-image-registry.apps-crc.testing/openshift/vddk:8.0.3.14'
          imageID: ''
          started: false
    
    We can use the existence of the 'waiting' state on the pod to indicate
    that the image cannot be pulled.
    
    Unfortunately, the validation job's pods are deleted when the job
    fails due to a failure to pull the image. Because of this, there's no
    way to examine the pod status to see why the failure occurred after the
    deadline.
    
    So this patch removes the deadline from the validation job, which
    requires overhauling the validation logic slightly. We add a new
    advisory condition `VDDKInitImageNotReady` to indicate that we are
    still waiting to pull the VDDK init image, and  a new critical condition
    `VDDKInitImageUnavailable` to indicate that the condition has persisted
    for longer than the active deadline setting.
    
    Since the job will now retry pulling the vddk image indefinitely (due
    to the removal of the job deadline), we need to make sure that orphaned
    jobs don't run forever. So when the vddk image for a plan changes, we
    need to cancel all active validation jobs that are still running for the
    old vddk image.
    
    This overall approach has several advantages:
     - The user gets an early indication (via `VDDKInitImageNotReady`) that
       the image can't be pulled
     - The validation will automatically complete when any network
       interruption is resolved, without needing to delete and re-create the
       plan to start a new validation
     - The validation will no longer report a VDDKInvalid error when the
       image pull is very slow due to network issues because there is no
       longer a deadline for the job.
    
    Resolves: https://issues.redhat.com/browse/MTV-1150
    
    Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
    jonner committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    56de205 View commit details
    Browse the repository at this point in the history