Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image retrieval sample pipeline #441

Merged
merged 6 commits into from
Sep 25, 2023
Merged

Image retrieval sample pipeline #441

merged 6 commits into from
Sep 25, 2023

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Sep 20, 2023

This PR adds a sample pipeline as starting point for the creative commons image dataset.

We present a sample pipeline that demonstrates how to effectively utilize a creative
commons image dataset within a fondant pipeline. This dataset comprises images from diverse sources
and is available in various data formats. In this illustrative example, our objective is to refine
the dataset to exclusively include PNG files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how much this filters ? How many rows remain ?

examples/pipelines/filter-cc-25m/README.md Show resolved Hide resolved
examples/pipelines/filter-cc-25m/README.md Outdated Show resolved Hide resolved
examples/pipelines/filter-cc-25m/README.md Outdated Show resolved Hide resolved
examples/pipelines/filter-cc-25m/pipeline.py Outdated Show resolved Hide resolved
examples/pipelines/filter-cc-25m/README.md Outdated Show resolved Hide resolved
arguments={
"dataset_name": "fondant-ai/fondant-cc-25m",
"column_name_mapping": load_component_column_mapping,
"n_rows_to_load": 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this as default for the final dataset?

@mrchtr mrchtr merged commit ec84a49 into main Sep 25, 2023
5 checks passed
@mrchtr mrchtr deleted the add-sample-pipeline-cc-25m branch September 25, 2023 12:07
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds a sample pipeline as starting point for the creative
commons image dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants