Skip to content

Commit

Permalink
Bugfix/sample pipeline cc 25m (#461)
Browse files Browse the repository at this point in the history
This PR:

- adds the missing fields `surt_url` and `top_level_domain` needed to
run the sample CC pipeline
-  adjusts `n_rows_to_load` to 100k to match the HF dataset card

---------

Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
  • Loading branch information
shayorshay and mrchtr authored Sep 25, 2023
1 parent f13da17 commit 9e9da3f
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 8 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Load from hub
description: Component that loads a dataset from the hub
image: ghcr.io/ml6team/load_from_hf_hub:dev
image: ghcr.io/ml6team/load_from_hf_hub:latest

produces:
images:
Expand All @@ -15,6 +15,10 @@ produces:
type: string
webpage+url:
type: string
surt+url:
type: string
top+level+domain:
type: string

args:
dataset_name:
Expand Down
4 changes: 3 additions & 1 deletion examples/pipelines/filter-cc-25m/filter_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,16 @@ def create_directory_if_not_exists(path):
"license_location": "images_license+location",
"license_type": "images_license+type",
"webpage_url": "images_webpage+url",
"surt_url": "images_surt+url",
"top_level_domain": "images_top+level+domain",
}

load_from_hf_hub = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "fondant-ai/fondant-cc-25m",
"column_name_mapping": load_component_column_mapping,
"n_rows_to_load": 1000, # Here you can modify the number of images you want to download.
"n_rows_to_load": 10000, # Here you can modify the number of images you want to download.
},
)

Expand Down
9 changes: 3 additions & 6 deletions examples/pipelines/filter-cc-25m/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,22 +30,19 @@ def create_directory_if_not_exists(path):
"license_location": "images_license+location",
"license_type": "images_license+type",
"webpage_url": "images_webpage+url",
"surt_url": "images_surt+url",
"top_level_domain": "images_top+level+domain",
}

load_from_hf_hub = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "fondant-ai/fondant-cc-25m",
"column_name_mapping": load_component_column_mapping,
"n_rows_to_load": 1000, # Here you can modify the number of images you want to download.
"n_rows_to_load": 10000, # Here you can modify the number of images you want to download.
},
)

# Filter mime type component
filter_mime_type = ComponentOp(
component_dir="components/filter_file_type", arguments={"mime_type": "image/png"}
)

# Download images component
download_images = ComponentOp.from_registry(name="download_images", arguments={})

Expand Down

0 comments on commit 9e9da3f

Please sign in to comment.