Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use from_registry for generic components #285

Merged
merged 4 commits into from
Jul 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def build_pipeline():
pipeline.add_op(load_from_hub_op)

custom_op = ComponentOp(
component_spec_path="components/custom_component/fondant_component.yaml",
component_dir="components/custom_component",
arguments={
"min_width": 600,
"min_height": 600,
Expand Down
2 changes: 1 addition & 1 deletion docs/component_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ If an argument is not explicitly provided, the default value will be used instea
from fondant.pipeline import ComponentOp

custom_op = ComponentOp(
component_spec_path="components/custom_component/fondant_component.yaml",
component_dir="components/custom_component",
arguments={
"custom_argument": "foo"
},
Expand Down
148 changes: 148 additions & 0 deletions docs/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
provides a lot of components out of the box
([overview](https://github.com/ml6team/fondant/tree/main/components)), but you can also define your
own custom components.

## The anatomy of a component

A component is completely defined by its [component specification](component_spec.md) and a
docker image. The specification defines the docker image fondant should run to execute the
component, which data it consumes and produces, and which arguments it takes.

## Component types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the descriptions, quite clear


We can distinguish three different types of components:

- **Reusable components** can be used out of the box and can be loaded from the fondant
component registry
- **Custom components** are completely defined and implemented by the user
- **Generic components** leverage a reusable implementation, but require a custom component
specification

### Reusable components

Reusable components are completely defined and implemented by fondant. You can easily add them
to your pipeline by creating an operation using `ComponentOp.from_registry()`.

```python
from fondant.pipeline import ComponentOp

component_op = ComponentOp.from_registry(
name="reusable_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp.from_registry"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those meant to end up here? there are quite a few of them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can see the result in the built documentation:
https://fondant--285.org.readthedocs.build/en/285/components/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks neat ✨


::: fondant.pipeline.ComponentOp.from_registry
handler: python
options:
show_source: false

You can find an overview of the reusable components offered by fondant
[here](https://github.com/ml6team/fondant/tree/main/components). Check their
`fondant_component.yaml` file for information on which arguments they accept and which data they
consume and produce.

### Custom components

To define your own custom component, you can build your code into a docker image and write an
accompanying component specification that refers to it.

A typical file structure for a custom component looks like this:
```
|- components
| |- custom_component
| |- src
| | |- main.py
| |- Dockerfile
| |- fondant_component.yaml
|- pipeline.py
```

The `Dockerfile` is used to build the code into a docker image, which is then referred to in the
`fondant_component.yaml`.

```yaml title="components/custom_component/fondant_component.yaml"
name: Custom component
description: This is a custom component
image: custom_component:latest
```

You can add a custom component to your pipeline by creating a `ComponentOp` and passing in the path
to the directory containing your `fondant_component.yaml`.

```python title="pipeline.py"
from fondant.pipeline import ComponentOp

component_op = ComponentOp(
component_dir="components/custom_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp"

::: fondant.pipeline.ComponentOp
handler: python
options:
members: []
show_source: false

See our [best practices on creating a custom component](custom_component.md).

### Generic components

A generic component is a component leveraging a reusable docker image, but requiring a custom
`fondant_component.yaml` specification.

Since a generic component only requires a custom `fondant_component.yaml`, its file structure
looks like this:
```
|- components
| |- generic_component
| |- fondant_component.yaml
|- pipeline.py
```

The `fondant_component.yaml` refers to the reusable image it leverages:

```yaml title="components/generic_component/fondant_component.yaml"
name: Generic component
description: This is a generic component
image: reusable_component:latest
```

You can add a generic component to your pipeline by creating a `ComponentOp` and passing in the path
to the directory containing your custom `fondant_component.yaml`.

```python title="pipeline.py"
from fondant.pipeline import ComponentOp

component_op = ComponentOp(
component_dir="components/generic_component",
arguments={
"arg": "value"
}
)
```

??? "fondant.pipeline.ComponentOp"

::: fondant.pipeline.ComponentOp
handler: python
options:
members: []
show_source: false

An example of a generic component is the
[`load_from_hf_hub`](https://github.com/ml6team/fondant/tree/main/components/load_from_hf_hub)
components. It can read any dataset from the HuggingFace hub, but it requires the user to define
the schema of the produced dataset in a custom `fondant_component.yaml` specification.
7 changes: 3 additions & 4 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,8 @@ Now that we have a pipeline, we can add components to it. Components are the bui
Let's add a reusable component to our pipeline. We will use the `load_from_hf_hub` component to read data from huggingface. Add the following code to your `pipeline.py` file:

```Python
load_from_hf_hub = ComponentOp.from_registry(
name='load_from_hf_hub',
component_spec_path='components/load_from_hf_hub/fondant_component.yml',
load_from_hf_hub = ComponentOp(
component_dir='components/load_from_hf_hub',
arguments={
'dataset_name': 'huggan/pokemon',
'n_rows_to_load': 100,
Expand Down Expand Up @@ -278,7 +277,7 @@ With our component complete we can now add it to our pipeline definition (`pipel

```python
extract_resolution = ComponentOp(
component_spec_path='components/extract_resolution/fondant_component.yml',
component_dir='components/extract_resolution',
)

my_pipeline.add_op(load_from_hf_hub) # this line was already there
Expand Down
2 changes: 1 addition & 1 deletion docs/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def build_pipeline():
pipeline.add_op(load_from_hub_op)

caption_images_op = ComponentOp(
component_spec_path="components/captioning_component/fondant_component.yaml",
component_dir="components/captioning_component",
arguments={
"model_id": "Salesforce/blip-image-captioning-base",
"batch_size": 2,
Expand Down
7 changes: 3 additions & 4 deletions examples/pipelines/controlnet-interior-design/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# Define component ops
generate_prompts_op = ComponentOp(
component_spec_path="components/generate_prompts/fondant_component.yaml",
component_dir="components/generate_prompts",
arguments={"n_rows_to_load": None},
)
laion_retrieval_op = ComponentOp.from_registry(
Expand Down Expand Up @@ -59,9 +59,8 @@
node_pool_name="model-inference-pool",
)

write_to_hub_controlnet = ComponentOp.from_registry(
name="write_to_hf_hub",
component_spec_path="components/write_to_hub_controlnet/fondant_component.yaml",
write_to_hub_controlnet = ComponentOp(
component_dir="components/write_to_hub_controlnet",
arguments={
"username": "test-user",
"dataset_name": "segmentation_kfp",
Expand Down
12 changes: 5 additions & 7 deletions examples/pipelines/datacomp/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@
"clip_l14_similarity_score": "image_text_clip_l14_similarity_score",
}

load_from_hub_op = ComponentOp.from_registry(
name="load_from_hf_hub",
component_spec_path="components/load_from_hf_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "nielsr/datacomp-small-with-embeddings",
"column_name_mapping": load_component_column_mapping,
Expand All @@ -48,17 +47,16 @@
arguments={"min_image_dim": 200, "max_aspect_ratio": 3},
)
filter_complexity_op = ComponentOp(
component_spec_path="components/filter_text_complexity/fondant_component.yaml",
component_dir="components/filter_text_complexity",
arguments={
"spacy_pipeline": "en_core_web_sm",
"batch_size": 1000,
"min_complexity": 1,
"min_num_actions": 1,
},
)
cluster_image_embeddings_op = ComponentOp.from_registry(
name="cluster_image_embeddings",
component_spec_path="components/cluster_image_embeddings/fondant_component.yaml",
cluster_image_embeddings_op = ComponentOp(
component_dir="components/cluster_image_embeddings",
arguments={
"sample_ratio": 0.3,
"num_clusters": 3,
Expand Down
10 changes: 4 additions & 6 deletions examples/pipelines/finetune_stable_diffusion/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@
value: key for key, value in load_component_column_mapping.items()
}
# Define component ops
load_from_hub_op = ComponentOp.from_registry(
name="load_from_hf_hub",
component_spec_path="components/load_from_hf_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "logo-wizard/modern-logo-dataset",
"column_name_mapping": load_component_column_mapping,
Expand Down Expand Up @@ -72,9 +71,8 @@
node_pool_name="model-inference-pool",
)

write_to_hub = ComponentOp.from_registry(
name="write_to_hf_hub",
component_spec_path="components/write_to_hf_hub/fondant_component.yaml",
write_to_hub = ComponentOp(
component_dir="components/write_to_hf_hub",
arguments={
"username": "test-user",
"dataset_name": "stable_diffusion_processed",
Expand Down
5 changes: 2 additions & 3 deletions examples/pipelines/starcoder/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,8 @@
)

# define ops
load_from_hub_op = ComponentOp.from_registry(
name="load_from_hub",
component_spec_path="components/load_from_hub/fondant_component.yaml",
load_from_hub_op = ComponentOp(
component_dir="components/load_from_hub",
arguments={
"dataset_name": "ml6team/the-stack-smol-python",
"column_name_mapping": load_component_column_mapping,
Expand Down
11 changes: 8 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,18 @@ nav:
- Home: index.md
- Getting Started: getting_started.md
- Building a pipeline: pipeline.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Components:
- Components: components.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Data explorer: data_explorer.md
- Infrastructure: infrastructure.md
- Manifest: manifest.md

plugins:
- mkdocstrings

markdown_extensions:
- pymdownx.snippets:
check_paths: true
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ coveralls = "^3.3.1"

[tool.poetry.group.docs.dependencies]
mkdocs-material = "^9.1.8"
mkdocstrings = { version = "^0.20", extras = ["python"]}

[tool.poetry.scripts]
fondant = "fondant.cli:entrypoint"
Expand Down
2 changes: 1 addition & 1 deletion scripts/pre-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ root_path=$(dirname "$scripts_path")

pushd "$root_path"
rm -rf src/fondant/components
cp -r components src/fondant/
find components/ -type f | grep -i yaml$ | xargs -i cp --parents {} src/fondant/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only copy the component specifications and keep the same structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you validate this on mac @GeorgesLorre?

popd
9 changes: 5 additions & 4 deletions src/fondant/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,11 @@ def _generate_spec(self, pipeline: Pipeline, extra_volumes: list) -> dict:
"volumes": volumes,
}

if component_op.local_component:
services[safe_component_name][
"build"
] = f"./{Path(component_op.component_spec_path).parent}"
if component_op.dockerfile_path is not None:
logger.info(
f"Found Dockerfile for {component_name}, adding build step.",
)
services[safe_component_name]["build"] = str(component_op.component_dir)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation failed for absolute paths, this works for both absolute and relative paths.

else:
services[safe_component_name][
"image"
Expand Down
Loading