Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to datasets & apply interface #685

Merged
merged 10 commits into from
Dec 7, 2023

Conversation

RobbeSneyders
Copy link
Member

This is a feature branch to collect all changes related to the datasets & apply interface until we have a state we can merge.

RobbeSneyders and others added 9 commits December 7, 2023 01:40
This PR is the first one of multiple PRs to replace #665. This PR only
focuses on implementing the new pipeline interface, without adding any
new functionality.

The new interface applies operations to intermediate datasets instead of
adding operations to a pipeline, as shown below. It's a superficial
change, since only the interface is changed. All underlying behavior is
still the same.

The new interface fits nicely with our data format design and we'll be
able to leverage it for interactive development in the future. We can
calculate the schema for each intermediate dataset so the user can
inspect it. Or with eager execution, we could execute a single operation
and allow the user to explore the data using the dataset.

I still need to update the README generation, but I'll do that as a
separate PR. It becomes a bit more complex since we now need to
discriminate between read, transform, and write components to generate
the example code.

**Old interface**
```Python
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

load_op = ComponentOp(
    component_dir="load_data",
    arguments={...},
)

caption_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={...},
)

embed_op = ComponentOp(
    component_dir="embed_text",
    arguments={...},
)

write_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={...},
)

pipeline.add_op(load_op)
pipeline.add_op(caption_op, dependencies=[load_op])
pipeline.add_op(embed_op, dependencies=[caption_op])
pipeline.add_op(write_op, dependencies=[embed_op])
```

**New interface**
```Python
pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

dataset = pipeline.read(
    "load_data",
    arguments={...},
)
dataset = dataset.apply(
    "caption_images",
    arguments={...},
)
dataset = dataset.apply(
    "embed_text",
    arguments={...},
)
dataset.write(
    "write_to_hf_hub",
    arguments={...},
)
This PR updates the component generation to take into account the
component type and generate the appropriate usage example.
PR that introduces functionality to new pipeline interface as discussed
[here](#567 (comment))

* The component spec now accepts **OneOf** additionalFields or Fields in
it's consumes and produces section
* The new `consumes` and `produces` are defined at the Op level
similarly to the ones in the component spec, if they are present, they
will override the default `consumes` and `produces` defined in the
component spec (manifet, dataIO)
* Some changes were added to `DataIO` just to resolve tests issues but
the new functionality of the custom consumes and produces is not
implemented yet (will be tackled in a separate PR)

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@ml6.eu>
Follows #695
This PR adds the mapping functionality the `dataIO` module

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@ml6.eu>
@RobbeSneyders RobbeSneyders marked this pull request as ready for review December 7, 2023 01:01
@RobbeSneyders RobbeSneyders merged commit 6cd987e into main Dec 7, 2023
5 of 6 checks passed
@RobbeSneyders RobbeSneyders deleted the feature/dataset-apply-interface branch December 7, 2023 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement dataset & apply pipeline interface
2 participants