You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the consumes section in the component specification contains additionalProperties=true , it allows passing additional columns along the defined ones to the component.
This is especially needed for reusable write components.
This will raise an InvalidPipelineDefinition, fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key text_data in the consumes argument passed to the operation, but text_data is not defined in the consumes section of the component spec., due to the fact that no columns are specified in the component specification.
Furthermore, if the component is used without defining consumes in the pipeline, no columns will be passed to the component.
dataset.write(
"write_to_file"
)
On the other hand, we have implemented custom logic for lightweight Python components. This allows us to use these components without invoking consumes in the pipeline.
Mapping the names from the global dataset to the components dataset is not working too.
Goal
Ideally we should permit all three options, when additionalProperties=true,
(1) Do not define any consumes in the pipeline interface. This will pass all available columns to the component.
(2) Define column mapping (e.g. "text_data": "text"), which maps from the global dataset to the component dataset.
(3) Define which columns to read from the global dataset using pyarrow types, such as "text_data": pa.string().
Approach
We complete the component specifications of lightweight components using the global schema of the dataset already in the pipeline.py.
A similar approach could be adopted for containerized components, as all the necessary information can be obtained from the component yaml and the schema of the global dataset.
We would have consistent implementation for both lightweight and containerized components at the same location in the code, and the ability to validate and catch potential errors during pipeline initialization.
This enables (1) for the containerized components. (2) we can achieve this for both component types by using the global dataset schema. Validate the mapping and complete the ComponentSpec.
(3) is still working, and we could add extra validation.
Alternatively, if the consumes is None and additionalProperties is set to true, we could directly load all columns of the dataset in the data_io.
The text was updated successfully, but these errors were encountered:
Basic implementation, I still have to add tests. I wanted to get some
feedback first.
- Added a method to the lightweight components to generate a
`ComponentSpec` based on the attributes.
- Added a method in the pipeline to infer the consumption based on the
`ComponentSpec`.
In cases where a user hasn't specified a `consume` in the pipeline
operations, we now infer this. If a component spec contains a `consumes`
section and `additionalProperties` are set to true, we load all columns.
If `additionalProperties` is set to false, we limit the columns defined
in the component spec.
Fix#836
Background
When the
consumes
section in the component specification containsadditionalProperties=true
, it allows passing additional columns along the defined ones to the component.This is especially needed for reusable write components.
Name-to-name mapping cannot be used. For example:
This will raise an InvalidPipelineDefinition,
fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key text_data in the consumes argument passed to the operation, but text_data is not defined in the consumes section of the component spec.
, due to the fact that no columns are specified in the component specification.Furthermore, if the component is used without defining
consumes
in the pipeline, no columns will be passed to the component.On the other hand, we have implemented custom logic for lightweight Python components. This allows us to use these components without invoking
consumes
in the pipeline.Mapping the names from the global dataset to the components dataset is not working too.
Goal
Ideally we should permit all three options, when
additionalProperties=true
,(1) Do not define any
consumes
in the pipeline interface. This will pass all available columns to the component.(2) Define column mapping (e.g.
"text_data": "text"
), which maps from the global dataset to the component dataset.(3) Define which columns to read from the global dataset using pyarrow types, such as
"text_data": pa.string()
.Approach
We complete the component specifications of lightweight components using the global schema of the dataset already in the pipeline.py.
A similar approach could be adopted for containerized components, as all the necessary information can be obtained from the component yaml and the schema of the global dataset.
We would have consistent implementation for both lightweight and containerized components at the same location in the code, and the ability to validate and catch potential errors during pipeline initialization.
This enables (1) for the containerized components. (2) we can achieve this for both component types by using the global dataset schema. Validate the mapping and complete the ComponentSpec.
(3) is still working, and we could add extra validation.
Alternatively, if the consumes is None and additionalProperties is set to true, we could directly load all columns of the dataset in the
data_io
.The text was updated successfully, but these errors were encountered: