Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching workflow #325

Closed
wants to merge 22 commits into from
Closed

Conversation

PhilippeMoussalli
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli commented Jul 31, 2023

Related to #317
First Review PRs #318 #320
Design discussions #292

PR that implements the general component caching mechanism described in
255908735-84c4af6b-678e-4d3e-8078-3455bb9bf114

Few notes:

  • There is no need to specify previous run-id when caching since we're checking for existing executions of similar definition based on the manifest and the cache key
  • Created a Metadata class since the metadata of the manifest had different schema between both the local and remote runner
  • If a component is cached, it skips the execution entirely and just fetches the existing output manifest (based on the matching cache key). For kubeflow, we still have to write that manifest as an output artifact that will be read by the next component (artifact) but we don't write anything to the base oath (custom_artifact)
  • Currently implemented only for the remote runner (kfp) and enabled by default, can be disabled by setting Pipeline(cache_disabled = True). Disabled for remote runner since there is no straightforward method of estimating digests in that workflow (see Add method to estimate caching key #318)
    *If one component is not cached, then all subsequent components must be re-run/executed

Things to improve in later PRs:

  1. Decide whether we should still estimating the cache during compilation or execution (right now it's done at compilation time to enable defining new arguments but we might miss the fact that some image digests are updated):

A) Compilation

  • Right now, the component has to be run and produce the output manifest but we already know beforehand if this component should be executed or not. Ideally we would omit execution of component altogether to make pipelines run faster and avoid pod initialization waiting time. This can be done by either:
    • Implementing conditional statement in KFP link.
    • Assigning the default node pool to the cached component in case specified otherwise (avoid initialization of expensive nodes since the component just has to write the manifest)

B) Runtime

  • Find a method to estimate the digest from within the component itself (see discussion at Add method to estimate caching key #318), this solution will require all component to be run since the component itself has to decide whether it can be cached or not when it runs
  • Alternatively we can have one custom component at the beginning of each pipeline runner (initializer component), that decides which components should be run. The advantage here is that we only include all the required packages for estimating digests in one docker images and not all of them. See this for more details on how image digests can be retrieved.
  1. Implement the caching for the local runner

@PhilippeMoussalli PhilippeMoussalli changed the base branch from main to skip-component-run-argument August 18, 2023 08:56
RobbeSneyders pushed a commit that referenced this pull request Aug 22, 2023
PR that creates a metadata class, this will make it easier to implement
#368 (was originally part of #325 but decided to break it down to make
it easier to review).

Few other notable changes: 

- The `run_id` between both runners has now an identical format
(name_timestamp), we no longer need the uid of kfp since it's just used
to store the native output artifacts
- The `safe_component_name` has been moved from the local runner to the
component spec to avoid having to plug it everywhere

---------

Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Base automatically changed from skip-component-run-argument to main August 24, 2023 08:05
@PhilippeMoussalli
Copy link
Contributor Author

Closed in favor of #387

Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PR that creates a metadata class, this will make it easier to implement
#368 (was originally part of #325 but decided to break it down to make
it easier to review).

Few other notable changes: 

- The `run_id` between both runners has now an identical format
(name_timestamp), we no longer need the uid of kfp since it's just used
to store the native output artifacts
- The `safe_component_name` has been moved from the local runner to the
component spec to avoid having to plug it everywhere

---------

Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
@RobbeSneyders RobbeSneyders deleted the implement-caching-workflow branch January 11, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate cache key with remote runner workflow
1 participant