Passing all columns to the write component when no consumes is specified #836

mrchtr · 2024-02-02T08:37:46Z

Background

When the consumes section in the component specification contains additionalProperties=true , it allows passing additional columns along the defined ones to the component.

This is especially needed for reusable write components.

dataset.write(
    "write_to_file",
    "consumes": {
    	"text": pa.string()
    }
)

Name-to-name mapping cannot be used. For example:

dataset.write(
    "write_to_file",
    consumes={
        "text_data": "text_data"
    }
)

This will raise an InvalidPipelineDefinition,
fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key text_data in the consumes argument passed to the operation, but text_data is not defined in the consumes section of the component spec., due to the fact that no columns are specified in the component specification.

Furthermore, if the component is used without defining consumes in the pipeline, no columns will be passed to the component.

dataset.write(
    "write_to_file"
)

On the other hand, we have implemented custom logic for lightweight Python components. This allows us to use these components without invoking consumes in the pipeline.

@lightweight_component
class WriteToAnywhere(DaskWriteComponent):
    def write(self, dataframe: dd.DataFrame) -> None:
        print(dataframe.columns)

dataset.write(
    WriteToAnywhere
)

Mapping the names from the global dataset to the components dataset is not working too.

Goal

Ideally we should permit all three options, when additionalProperties=true,

(1) Do not define any consumes in the pipeline interface. This will pass all available columns to the component.

(2) Define column mapping (e.g. "text_data": "text"), which maps from the global dataset to the component dataset.

(3) Define which columns to read from the global dataset using pyarrow types, such as "text_data": pa.string().

Approach

We complete the component specifications of lightweight components using the global schema of the dataset already in the pipeline.py.

A similar approach could be adopted for containerized components, as all the necessary information can be obtained from the component yaml and the schema of the global dataset.

We would have consistent implementation for both lightweight and containerized components at the same location in the code, and the ability to validate and catch potential errors during pipeline initialization.

This enables (1) for the containerized components. (2) we can achieve this for both component types by using the global dataset schema. Validate the mapping and complete the ComponentSpec.

(3) is still working, and we could add extra validation.

Alternatively, if the consumes is None and additionalProperties is set to true, we could directly load all columns of the dataset in the data_io.

The text was updated successfully, but these errors were encountered:

Basic implementation, I still have to add tests. I wanted to get some feedback first. - Added a method to the lightweight components to generate a `ComponentSpec` based on the attributes. - Added a method in the pipeline to infer the consumption based on the `ComponentSpec`. In cases where a user hasn't specified a `consume` in the pipeline operations, we now infer this. If a component spec contains a `consumes` section and `additionalProperties` are set to true, we load all columns. If `additionalProperties` is set to false, we limit the columns defined in the component spec. Fix #836

mrchtr added this to Fondant development Feb 2, 2024

mrchtr converted this from a draft issue Feb 2, 2024

mrchtr mentioned this issue Feb 2, 2024

Add focus on writing output dataset #823

Closed

RobbeSneyders assigned RobbeSneyders and mrchtr and unassigned RobbeSneyders Feb 5, 2024

RobbeSneyders mentioned this issue Feb 5, 2024

Add user-friendly defaults for consumes / produces #846

Closed

mrchtr mentioned this issue Feb 6, 2024

Update readme index weaviate component #843

Merged

mrchtr moved this from Backlog to In Progress in Fondant development Feb 15, 2024

mrchtr mentioned this issue Feb 16, 2024

Infer consume operation if not present in dataset interface #859

Merged

mrchtr moved this from In Progress to Validation in Fondant development Feb 20, 2024

RobbeSneyders closed this as completed in #859 Feb 20, 2024

github-project-automation bot moved this from Validation to Done in Fondant development Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing all columns to the write component when no consumes is specified #836

Passing all columns to the write component when no consumes is specified #836

mrchtr commented Feb 2, 2024 •

edited

Loading

Passing all columns to the write component when no consumes is specified #836

Passing all columns to the write component when no consumes is specified #836

Comments

mrchtr commented Feb 2, 2024 • edited Loading

Background

Goal

Approach

mrchtr commented Feb 2, 2024 •

edited

Loading