Skip to content

Commit

Permalink
[DataComp] Update pipeline name, remove DockerCompiler (#340)
Browse files Browse the repository at this point in the history
This PR updates the pipeline name of DataComp, and makes sure it can run
fine both locally and on GCP.

Switching between local vs GCP is currently done as follows:
- I need to manually comment out the base path I don't want to use
- it's either running `fondant run pipeline:pipeline --local` or `python
pipeline.py`

<img width="391" alt="Screenshot 2023-08-08 at 13 03 05"
src="https://github.com/ml6team/fondant/assets/48327001/3fc14f9a-23be-4d0b-bb39-022ae42e69a3">
  • Loading branch information
NielsRogge authored Aug 8, 2023
1 parent edf1039 commit 9fcc994
Show file tree
Hide file tree
Showing 5 changed files with 11 additions and 22 deletions.
2 changes: 1 addition & 1 deletion components/filter_image_resolution/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Filter image resolution
description: Component that filters images based on minimum size and max aspect ratio
image: ghcr.io/ml6team/filter_image_resolution:latest
image: ghcr.io/ml6team/filter_image_resolution:dev

consumes:
image:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Filter text complexity
description: Component that filters text based on their dependency parse complexity and number of actions
image: ghcr.io/ml6team/filter_text_complexity:latest
image: ghcr.io/ml6team/filter_text_complexity:dev

consumes:
text:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,6 @@ produces:
type: float32
sha256:
type: utf8
embedding:
type: array
items:
type: float32

text:
fields:
Expand Down
23 changes: 8 additions & 15 deletions examples/pipelines/datacomp/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@

# Initialize pipeline and client
pipeline = Pipeline(
pipeline_name="datacomp-filtering",
pipeline_name="datacomp-filtering-pipeline",
pipeline_description="A pipeline for filtering the Datacomp dataset",
# base_path=PipelineConfigs.BASE_PATH,
base_path="/Users/nielsrogge/Documents/fondant_artifacts_datacomp",
base_path=PipelineConfigs.BASE_PATH,
)
client = Client(host=PipelineConfigs.HOST)

Expand All @@ -27,7 +26,6 @@
"original_height": "image_height",
"face_bboxes": "image_face_bboxes",
"sha256": "image_sha256",
"clip_l14_embedding": "image_embedding",
"text": "text_data",
"clip_b32_similarity_score": "image_text_clip_b32_similarity_score",
"clip_l14_similarity_score": "image_text_clip_l14_similarity_score",
Expand All @@ -36,9 +34,8 @@
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "nielsr/datacomp-small-with-embeddings",
"dataset_name": "mlfoundations/datacomp_small",
"column_name_mapping": load_component_column_mapping,
"n_rows_to_load": 100,
},
)
filter_image_resolution_op = ComponentOp.from_registry(
Expand All @@ -51,20 +48,16 @@
"spacy_pipeline": "en_core_web_sm",
"batch_size": 1000,
"min_complexity": 1,
"min_num_actions": 1,
},
)
cluster_image_embeddings_op = ComponentOp(
component_dir="components/cluster_image_embeddings",
arguments={
"sample_ratio": 0.3,
"num_clusters": 3,
"min_num_actions": 0,
},
)

# add ops to pipeline
pipeline.add_op(load_from_hub_op)
pipeline.add_op(filter_image_resolution_op, dependencies=load_from_hub_op)
pipeline.add_op(filter_complexity_op, dependencies=filter_image_resolution_op)
pipeline.add_op(cluster_image_embeddings_op, dependencies=filter_complexity_op)
# TODO add more ops


if __name__ == "__main__":
client.compile_and_run(pipeline=pipeline)
2 changes: 1 addition & 1 deletion scripts/build_components.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ function usage {
echo " -t, --tag <value> Tag to add to image, repeatable
The first tag is set in the component specifications"
echo " -c, --cache <value> Use registry caching when building the components (default:false)"
echo " -d, --component-dirs <value> Directory containing components to build as subdirectories.
echo " -d, --components-dir <value> Directory containing components to build as subdirectories.
The path should be relative to the root directory (default:components)"
echo " -n, --namespace <value> The namespace for the built images, should match the github organization (default: ml6team)"
echo " -co, --component <value> Specific component to build. Pass the component subdirectory name(s) to build
Expand Down

0 comments on commit 9fcc994

Please sign in to comment.