Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing all columns to the write component when no consumes is specified #836

Closed
Tracked by #846
mrchtr opened this issue Feb 2, 2024 · 0 comments · Fixed by #859
Closed
Tracked by #846

Passing all columns to the write component when no consumes is specified #836

mrchtr opened this issue Feb 2, 2024 · 0 comments · Fixed by #859
Assignees

Comments

@mrchtr
Copy link
Contributor

mrchtr commented Feb 2, 2024

Background

When the consumes section in the component specification contains additionalProperties=true , it allows passing additional columns along the defined ones to the component.

This is especially needed for reusable write components.

dataset.write(
    "write_to_file",
    "consumes": {
    	"text": pa.string()
    }
)

Name-to-name mapping cannot be used. For example:

dataset.write(
    "write_to_file",
    consumes={
        "text_data": "text_data"
    }
)

This will raise an InvalidPipelineDefinition,
fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key text_data in the consumes argument passed to the operation, but text_data is not defined in the consumes section of the component spec., due to the fact that no columns are specified in the component specification.

Furthermore, if the component is used without defining consumes in the pipeline, no columns will be passed to the component.

dataset.write(
    "write_to_file"
)

On the other hand, we have implemented custom logic for lightweight Python components. This allows us to use these components without invoking consumes in the pipeline.

@lightweight_component
class WriteToAnywhere(DaskWriteComponent):
    def write(self, dataframe: dd.DataFrame) -> None:
        print(dataframe.columns)

dataset.write(
    WriteToAnywhere
)

Mapping the names from the global dataset to the components dataset is not working too.

Goal

Ideally we should permit all three options, when additionalProperties=true,

(1) Do not define any consumes in the pipeline interface. This will pass all available columns to the component.

(2) Define column mapping (e.g. "text_data": "text"), which maps from the global dataset to the component dataset.

(3) Define which columns to read from the global dataset using pyarrow types, such as "text_data": pa.string().

Approach

We complete the component specifications of lightweight components using the global schema of the dataset already in the pipeline.py.

A similar approach could be adopted for containerized components, as all the necessary information can be obtained from the component yaml and the schema of the global dataset.

We would have consistent implementation for both lightweight and containerized components at the same location in the code, and the ability to validate and catch potential errors during pipeline initialization.

This enables (1) for the containerized components. (2) we can achieve this for both component types by using the global dataset schema. Validate the mapping and complete the ComponentSpec.

(3) is still working, and we could add extra validation.

Alternatively, if the consumes is None and additionalProperties is set to true, we could directly load all columns of the dataset in the data_io.

@mrchtr mrchtr converted this from a draft issue Feb 2, 2024
@mrchtr mrchtr moved this from Backlog to In Progress in Fondant development Feb 15, 2024
@mrchtr mrchtr moved this from In Progress to Validation in Fondant development Feb 20, 2024
RobbeSneyders pushed a commit that referenced this issue Feb 20, 2024
Basic implementation, I still have to add tests. I wanted to get some
feedback first.

- Added a method to the lightweight components to generate a
`ComponentSpec` based on the attributes.
- Added a method in the pipeline to infer the consumption based on the
`ComponentSpec`.
In cases where a user hasn't specified a `consume` in the pipeline
operations, we now infer this. If a component spec contains a `consumes`
section and `additionalProperties` are set to true, we load all columns.
If `additionalProperties` is set to false, we limit the columns defined
in the component spec.

Fix #836
@github-project-automation github-project-automation bot moved this from Validation to Done in Fondant development Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants