Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation dataset first interface #921

Merged
merged 3 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

### Fondant architecture overview

![data explorer](art/architecture.png)
![fondant architecture](art/architecture.png)

At a high level, Fondant consists of three main parts:

Expand All @@ -13,7 +13,7 @@ At a high level, Fondant consists of three main parts:
specifications mainly include the component image location, arguments, columns it consumes and
produces.
* `manifest.py` Describes dataset content, facilitating reference passing between components.
It evolves during pipeline execution and aids static evaluation.
It evolves during a dataset materialization and aids static evaluation.
* `schema.py` Defines the Type class, used for dataset data type definition.
* `/schema` Directory Containing JSON schema specifications for the component spec and manifest.

Expand All @@ -37,25 +37,25 @@ At a high level, Fondant consists of three main parts:
component type.


* The `/dataset` directory which contains the modules for implementing a Fondant pipeline.
* `dataset.py`: Defines the `Dataset` class which is used to define the graph. The
implemented class is then consumed by the compiler to compile to a specific runner.
This module also implements the
`ComponentOp` class which is used to define the component operation in the pipeline graph.
* The `/dataset` directory which contains the modules for implementing a Fondant dataset.
* `dataset.py`: Defines the `Dataset` class which is used to define the workflow graph to
materialize the dataset. The implemented class is then consumed by the compiler to compile
to a specific workflow runner.
This module also implements the `ComponentOp` class which is used to define the component
operation in the workflow graph.
* `compiler.py`: Defines the `Compiler` class which is used to define the compiler that
compilers the pipeline graph for a specific
runner.
compilers the workflow graph for a specific runner.
* `runner.py`: Defines the `Runner` class which is used to define the runner that executes the
compiled pipeline graph.
compiled workflow graph.

### Additional modules

Additional modules in Fondant include:

* `cli.py`: Defines the CLI for interacting with Fondant. This includes the `fondant` command line
tool which is used to build components,
compile and run pipelines and explore datasets.
compile and run workflows to materialize and explore datasets.
* `explore.py`: Runs the explorer which is a web application that allows the user to explore the
content of a dataset.
* `build.py`: Defines the `build` command which is used to build and publish a component.
* `testing.py`: Contains common testing utilities for testing components and pipelines.
* `testing.py`: Contains common testing utilities for testing components and datasets.
Binary file modified docs/art/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 13 additions & 14 deletions docs/caching.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
## What is caching?

Fondant supports caching of pipeline executions. If a certain component and its arguments
Fondant supports caching of workflow executions. If a certain component and its arguments
are exactly the same as in some previous execution, then its execution can be skipped and the output
dataset of the previous execution can be used instead.

Caching offers the following benefits:
1) **Reduced costs.** Skipping the execution of certain components can help avoid unnecessary costly computations.
2) **Faster pipeline runs.** Skipping the execution of certain components results in faster pipeline runs.
3) **Faster pipeline development.** Caching allows you develop and test your pipeline faster.
4) **Reproducibility.** Caching allows you to reproduce the results of a pipeline run by reusing
the outputs of a previous pipeline run.
2) **Faster workflow runs.** Skipping the execution of certain components results in faster workflow execution.
3) **Faster dataset development.** Caching allows you develop and test your datasets faster.
4) **Reproducibility.** Caching allows you to reproduce the results of a run by reusing
the outputs of a previous run.

!!! note "IMPORTANT"

The cached runs are tied to the base path which stores the caching key of previous component runs.
Changing the base path will invalidate the cache of previous executed pipelines.
The cached runs are tied to the working directory which stores the caching key of previous component runs.
Changing the orking directory will invalidate the cache of previous materialized datasets.

The caching feature is **enabled** by default.

Expand All @@ -23,7 +23,7 @@ The caching feature is **enabled** by default.
You can turn off execution caching at component level by setting the following:

```python
from fondant.pipeline.pipeline import ComponentOp
from fondant.dataset.dataset import ComponentOp

caption_images_op = ComponentOp(
component_dir="...",
Expand All @@ -35,12 +35,12 @@ caption_images_op = ComponentOp(
```

## How caching works
When Fondant runs a pipeline, it checks to see whether an execution exists in the base path based on
When Fondant materializes a dataset, it checks to see whether an execution exists in the working directory based on
the cache key of each component.

The cache key is defined as the combination of the following:

1) The **pipeline step's inputs.** These inputs include the input arguments' value (if any).
1) The **operation step's inputs.** These inputs include the input arguments' value (if any).

2) **The component's specification.** This specification includes the image tag and the fields
consumed and produced by each component.
Expand All @@ -51,11 +51,10 @@ The cache key is defined as the combination of the following:
If there is a matching execution in the base path (checked based on the output manifests),
the outputs of that execution are used and the step computation is skipped.

Additionally, only the pipelines with the same pipeline name will share the cache. Caching for
Additionally, only datasets with the same dataset name will share the cache. Caching for
components
with the `latest` image tag is disabled by default. This is because using `latest` image tags can
lead to unpredictable behavior due to
image updates. Moreover, if one component in the pipeline is not cached then caching will be
disabled for all
subsequent components.
image updates. Moreover, if one component in the dataset is not cached then caching will be
disabled for all subsequent components.

8 changes: 4 additions & 4 deletions docs/components/component_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ in the component specification, so we will need to specify the schema of the
fields when defining the components

```python
dataset = Dataset.read(
dataset = Dataset.create(
"load_from_csv",
arguments={
"dataset_uri": "path/to/dataset.csv",
Expand Down Expand Up @@ -196,7 +196,7 @@ by the next component. We can either load the `image` field:

```python

dataset = Dataset.read(
dataset = Dataset.create(
"load_from_csv",
arguments={
"dataset_uri": "path/to/dataset.csv",
Expand All @@ -219,7 +219,7 @@ or the `embedding` field:

```python

dataset = Dataset.read(
dataset = Dataset.create(
"load_from_csv",
arguments={
"dataset_uri": "path/to/dataset.csv",
Expand Down Expand Up @@ -268,7 +268,7 @@ These arguments are passed in when the component is instantiated.
If an argument is not explicitly provided, the default value will be used instead if available.

```python
dataset = Dataset.read(
dataset = pipeline.read(
"custom_component",
arguments={
"custom_argument": "foo"
Expand Down
22 changes: 11 additions & 11 deletions docs/components/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ from distributed import Client

# Components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
Fondant makes it easy to build dataset collaborative leveraging reusable components. Fondant
provides a lot of components out of the box
([overview](https://fondant.ai/en/latest/components/hub/)), but you can also define your
own custom components.
Expand All @@ -20,9 +20,9 @@ The logic should be implemented as a class, inheriting from one of the base `Com
offered by Fondant.
There are three large types of components:

- **`LoadComponent`**: Load data into your pipeline from an external data source
- **`TransformComponent`**: Implement a single transformation step in your pipeline
- **`WriteComponent`**: Write the results of your pipeline to an external data sink
- **`LoadComponent`**: Load data and initialise a dataset from an external data source
- **`TransformComponent`**: Implement a single transformation step to transform data in your dataset
- **`WriteComponent`**: Write your dataset to an external data sink

The easiest way to implement a `TransformComponent` is to subclass the provided
`PandasTransformComponent`. This component streams your data and offers it in memory-sized
Expand Down Expand Up @@ -124,7 +124,7 @@ implements the logic of your component.

```python
from fondant.component import PandasTransformComponent
from fondant.pipeline import lightweight_component
from fondant.dataset import lightweight_component
import pandas as pd
import pyarrow as pa

Expand All @@ -138,10 +138,10 @@ class AddNumber(PandasTransformComponent):
return dataframe
```

You can add a custom component to your pipeline by passing in the reference to the component class containing
You can apply a custom component to your dataset by passing in the reference to the component class containing
your script.

```python title="pipeline.py"
```python title="dataset.py"
_ = dataset.apply(
ref=AddNumber,
produces={"x": pa.int32()},
Expand All @@ -167,7 +167,7 @@ A typical file structure for a custom component looks like this:
| |- Dockerfile
| |- fondant_component.yaml
| |- requirements.txt
|- pipeline.py
|- dataset.py
```

The `Dockerfile` is used to build the code into a docker image, which is then referred to in the
Expand All @@ -179,10 +179,10 @@ description: This is a custom component
image: custom_component:latest
```

You can add a custom component to your pipeline by passing in the path to the directory containing
You can apply a custom component to your dataset by passing in the path to the directory containing
your `fondant_component.yaml`.

```python title="pipeline.py"
```python title="dataset.py"

dataset = dataset.apply(
component_dir="components/custom_component",
Expand All @@ -198,7 +198,7 @@ See our [best practices on creating a containerized component](../components/con
### Reusable components

Reusable components are out of the box containerized components from the Fondant Hub that you can easily add
to your pipeline:
to your dataset:

```python

Expand Down
4 changes: 2 additions & 2 deletions docs/components/containerized_components.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Creating containerized components

Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
Fondant makes it easy to build dataset collaborative leveraging reusable components. Fondant
provides a lot
of [components out of the box](https://fondant.ai/en/latest/components/hub/), but you can also
define your own containerized components.
Expand Down Expand Up @@ -79,6 +79,6 @@ transformers==4.29.2
```

Refer to this [section](publishing_components.md) to find out how to build and publish your components to use them in
your own pipelines.
your own dataset workflows.


41 changes: 18 additions & 23 deletions docs/components/lightweight_components.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Creating lightweight components

Lightweight components are a great way to implement custom data processing steps in your pipeline.
They are easy to implement and can be reused across different pipelines. If you want to
Lightweight components are a great way to implement custom data processing steps in your dataset workflows.
They are easy to implement and can be reused across different datasets. If you want to
build more complex components that require additional dependencies (e.g. GPU support), you can
also build a containerized component. See the [containerized component guide](../components/containerized_components.md) for more info.

To implement a lightweight component, you simply need to create a python script that implements
the component logic. Here is an example of a pipeline composed of two custom components,
the component logic. Here is an example of a dataset composed of two custom components,
one that creates a dataset and one that adds a number to a column of the dataset:

```python title="pipeline.py"
```python title="dataset.py"
from fondant.component import DaskLoadComponent, PandasTransformComponent
from fondant.pipeline import lightweight_component
from fondant.dataset import lightweight_component
import dask.dataframe as dd
import pandas as pd
import pyarrow as pa
Expand Down Expand Up @@ -42,31 +42,26 @@ Notice that we use the `@lightweight_component` decorator to define our componen
is used to package the component into a containerized component and can also be used to
define additional functionalities.

To register those components to a pipeline, we can use the `read` and `apply` method for the
To register those components to a dataset, we can use the `create` and `apply` method for the
first and second component respectively:

```python title="pipeline.py"
from fondant.pipeline import Pipeline
```python title="datast.py"
from fondant.dataset import Dataset

pipeline = Pipeline(
name="dummy-pipeline",
base_path="./data",
)

dataset = Dataset.read(
dataset = Dataset.create(
ref=CreateData,
dataset_name="dummy-pipeline",
)

_ = dataset.apply(
ref=AddNumber,
arguments={"n": 1},
)
```

Here we are creating a pipeline that reads data from the `CreateData` component and then applies
Here we are creating a dataset workflow that reads data from the `CreateData` component and then applies
the `AddNumber` component to it. The `produces` argument is used to define the schema of the output
of the component. This is used to validate the output of the component and to define the schema
of the next component in the pipeline.
of the next component in the dataset.

Behind the scenes, Fondant will automatically package the component into a containerized component that
uses a base image with the current installed Fondant and python version.
Expand All @@ -77,15 +72,15 @@ If you want to install additional requirements for your component, you can do so
package to the `extra_requires` argument of the `@lightweight_component` decorator. This will
install the package in the containerized component.

```python title="pipeline.py"
```python title="dataset.py"
@lightweight_component(extra_requires=["numpy"])
```

Under the hood, we are injecting the source to a docker container. If you want to use additional
dependencies, you have to make sure to import the libaries inside a function directly.

For example:
```python title="pipeline.py"
```python title="dataset.py"
...
def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
import numpy as np
Expand All @@ -100,7 +95,7 @@ If you want to change the base image of the containerized component, you can do
image instead of the default one. Make sure you install Fondant in the base image or list it
in the `extra_requires` argument.

```python title="pipeline.py"
```python title="dataset.py"
@lightweight_component(base_image="python:3.10-slim")
```

Expand All @@ -111,7 +106,7 @@ of the decorator.
If we take the previous example, we can restrict the columns that are loaded by the `AddNumber` component
by specifying the `x` column in the `consumes` argument:

```python title="pipeline.py"
```python title="dataset.py"
@lightweight_component(
consumes={
"x": pa.int32()
Expand All @@ -136,7 +131,7 @@ it to containerized component. See the [containerized component guide](../compon

You can also choose to load in dynamic fields by setting the `additionalProperties` argument to `True` in the `consumes` argument.

This will allow you to define an arbitrary number of columns to be loaded when applying your component to the pipeline.
This will allow you to define an arbitrary number of columns to be loaded when applying your component to the dataset.

This can be useful in scenarios when we want to dynamically load in fields from a dataset. For example, if we want to aggregate results
from multiple columns, we can define a component that loads in specific column from the previous component and then aggregates them.
Expand All @@ -147,7 +142,7 @@ the `x` and `z` columns into a new column `score`:
```python
import dask.dataframe as dd
from fondant.component import PandasTransformComponent
from fondant.pipeline import lightweight_component
from fondant.dataset import lightweight_component

@lightweight_component(
consumes={
Expand Down
2 changes: 1 addition & 1 deletion docs/components/publishing_components.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ component is located.

The tag arguments is used to specify the Docker container tag. When specified, the tag in the
referenced component specification yaml will also be
updated, ensuring that the next pipeline run correctly references the image.
updated, ensuring that the next dataset workflow run correctly references the image.


!!! note "IMPORTANT"
Expand Down
Loading
Loading