Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching writes #469

Merged
merged 11 commits into from
Oct 2, 2023
Merged

Fix caching writes #469

merged 11 commits into from
Oct 2, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

PR that fixes issues related to caching subsequent components.

Since cached manifests contain the same run id as the pipeline where they originated from, they can sometimes override existing data if the component needs to run again (e.g. cached first component loaded into second component which has different arguments and thus needs to execute). This PR makes sure that:

  • The run-id of the loaded input manifest matches the one from the current pipeline run, this ensures that all new subsets from the component being executed are written under the corresponding path of that pipeline run
  • The run-id of the output manifest is that of the cached manifest (if it applies). This ensures that subsequent components are able to tell whether the previous component execution was cached or not link to code

@RobbeSneyders
Copy link
Member

So If I'm understanding correctly, the logic should be as follows:

  • If the component is cached, it outputs the cached manifest with the run_id of the previous run
  • If the component is not cached, it outputs a new manifest with the run_id of the current run
  • A component checks if the previous component was cached by comparing the run_id in the manifest it receives with the run_id of the current run (from the metadata).

But the problem is that the component propagates the run_id from the input manifest to the output manifest.

If that is correct, can we change it where it's propagated instead? So in either:

I believe a change there would be simpler and easier to understand.

@RobbeSneyders
Copy link
Member

I think it can still be further simplified. Just pass the run_id to .evolve() and set it in the metadata at the start. That run_id is then used for the paths automatically as well, and then we don't need to update the metadata separately.

@PhilippeMoussalli
Copy link
Contributor Author

I think it can still be further simplified. Just pass the run_id to .evolve() and set it in the metadata at the start. That run_id is then used for the paths automatically as well, and then we don't need to update the metadata separately.

Thanks for the feedback Robbe! I implemented the suggestion. Please double check to see if that's what you were implying.

Also added other small fixes mainly related to propogating the cache arguments from the op to the pipeline spec

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli!

Almost there 🙂 I would make the argument optional and rename it to just run_id. See my suggestions.


Args:
component_spec: the component spec
write_run_id: the run id used to define the write subset path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
write_run_id: the run id used to define the write subset path.
run_id: the run id to include in the evolved manifest. If no run id is provided, the run id from the original manifest is propagated.

@@ -257,21 +257,27 @@ def remove_subset(self, name: str) -> None:
def evolve( # noqa : PLR0912 (too many branches)
self,
component_spec: ComponentSpec,
write_run_id: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
write_run_id: str,
*,
run_id: t.Optional[str],

"""
evolved_manifest = self.copy()

# Update `component_id` of the metadata
component_id = component_spec.name.lower().replace(" ", "_")
evolved_manifest.update_metadata(key="run_id", value=write_run_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
evolved_manifest.update_metadata(key="run_id", value=write_run_id)
if run_id is not None:
evolved_manifest.update_metadata(key="run_id", value=write_run_id)

evolved_manifest.update_metadata(key="component_id", value=component_id)

# Update index location as this is currently always rewritten
evolved_manifest.index._specification[
"location"
] = f"/{self.pipeline_name}/{self.run_id}/{component_id}/index"
] = f"/{self.pipeline_name}/{write_run_id}/{component_id}/index"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
] = f"/{self.pipeline_name}/{write_run_id}/{component_id}/index"
] = f"/{self.pipeline_name}/{evolved_manifest.run_id}/{component_id}/index"

@@ -322,7 +328,7 @@ def evolve( # noqa : PLR0912 (too many branches)
# Update subset location as this is currently always rewritten
evolved_manifest.subsets[subset_name]._specification[
"location"
] = f"/{self.pipeline_name}/{self.run_id}/{component_id}/{subset_name}"
] = f"/{self.pipeline_name}/{write_run_id}/{component_id}/{subset_name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
] = f"/{self.pipeline_name}/{write_run_id}/{component_id}/{subset_name}"
] = f"/{self.pipeline_name}/{evolved_manifest.run_id}/{component_id}/{subset_name}"

@@ -444,7 +444,7 @@ def _validate_pipeline_definition(self, run_id: str):
raise InvalidPipelineDefinition(
msg,
)
manifest = manifest.evolve(component_spec)
manifest = manifest.evolve(component_spec, write_run_id=run_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
manifest = manifest.evolve(component_spec, write_run_id=run_id)
manifest = manifest.evolve(component_spec, run_id=run_id)

evolved_manifest = manifest.evolve(component_spec=component_spec)
evolved_manifest = manifest.evolve(
component_spec=component_spec,
write_run_id=manifest.run_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this test to provide a different run_id and validate that it's set correctly?

evolved_manifest = manifest.evolve(component_spec=component_spec)
evolved_manifest = manifest.evolve(
component_spec=component_spec,
write_run_id=manifest.run_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobbeSneyders RobbeSneyders merged commit 645ab7a into main Oct 2, 2023
5 checks passed
@RobbeSneyders RobbeSneyders deleted the improve-caching-writing branch October 2, 2023 13:25
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PR that fixes issues related to caching subsequent components. 

Since cached manifests contain the same run id as the pipeline where
they originated from, they can sometimes override existing data if the
component needs to run again (e.g. cached first component loaded into
second component which has different arguments and thus needs to
execute). This PR makes sure that:

* The run-id of the loaded **input manifest** matches the one from the
**current pipeline run**, this ensures that all new subsets from the
component being executed are written under the corresponding path of
that pipeline run
* The run-id of the **output manifest** is that of the **cached
manifest** (if it applies). This ensures that subsequent components are
able to tell whether the previous component execution was cached or not
[link to
code](https://github.com/ml6team/fondant/blob/39436e7faaa3e5acdd5eb347ce965ad33d669f73/src/fondant/executor.py#L294)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants