Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataComp pipeline] Add first 2 components #223

Merged
merged 14 commits into from
Jun 22, 2023
Merged

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Jun 21, 2023

This PR adds the first two components of the DataComp pipeline, where it reuses the load_from_hf_hub component but with a different fondant_component.yaml file.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge!

The pipeline is failing on pre-commit. If you haven't yet, you can run pre-commit install so it automatically runs on every commit.

components/image_resolution_filtering/src/main.py Outdated Show resolved Hide resolved
examples/pipelines/datacomp/pipeline.py Outdated Show resolved Hide resolved
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Can you address them and then merge the PR so you can open new PRs for additional components?

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jun 22, 2023

Local pipeline is running fine until writing of the data of the second component, where it gives:

ValueError: The columns in the computed data do not match the columns in the provided metadata
datacomp-filter_text_complexity-1  |   Extra:   ['text_data']
datacomp-filter_text_complexity-1  |   Missing: []

=> the filter_text_complexity component is just a filtering component, so it doesn't add any new columns, hence its produces section in the spec is empty. This is correct right?

@NielsRogge NielsRogge changed the title [DataComp pipeline] Add load from hub component, pipeline [DataComp pipeline] Add first 2 components Jun 22, 2023
@RobbeSneyders
Copy link
Member

Yes that's correct.

The PandasTransformComponent does a strict check on the columns in the dataframe though. So it doesn't allow any additional columns. If you drop the column before returning the dataframe, so returning an empty dataframe with the correct index, it should work.

I think I know a way around this though. will open a PR.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jun 22, 2023

Ok, I guess we should wait to merge this PR before that is resolved? Else we can merge with the column being dropped

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge now and I'll update it in my PR.

@RobbeSneyders RobbeSneyders merged commit 50a2796 into main Jun 22, 2023
@RobbeSneyders RobbeSneyders deleted the add_datacomp_components branch June 22, 2023 10:20
RobbeSneyders added a commit that referenced this pull request Jun 23, 2023
This PR filters out data from the pandas dataframe returned by the user
that is not defined in the component spec. Previously, returning
additional columns would raise an error. (see
#223 (comment))
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds the first two components of the DataComp pipeline, where it
reuses the `load_from_hf_hub` component but with a different
`fondant_component.yaml` file.

---------

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR filters out data from the pandas dataframe returned by the user
that is not defined in the component spec. Previously, returning
additional columns would raise an error. (see
#223 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants