Skip to content

Commit

Permalink
Update roadmap in readme (#462)
Browse files Browse the repository at this point in the history
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
  • Loading branch information
GeorgesLorre and mrchtr authored Sep 25, 2023
1 parent 0eaf874 commit f13da17
Show file tree
Hide file tree
Showing 5 changed files with 34 additions and 39 deletions.
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
<br>
<br>
<a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
<a href="https://discord.gg/HnTdWhydGp"><img alt="Hello" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
<a href="https://pypi.org/project/fondant/"><img alt="PyPI version" src="https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square"></a>
<a href="https://fondant.readthedocs.io/en/latest/license/"><img alt="License" src="https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen"></a>
<a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
Expand Down Expand Up @@ -307,10 +307,8 @@ expect to run into rough edges, the foundations are ready and Fondant should alr
speed up your data preparation work.

**The following topics are on our roadmap**
- Local pipeline execution
- Non-linear pipeline DAGs
- LLM-focused example pipelines and reusable components
- Static validation, caching, and partial execution of pipelines
- Data lineage and experiment tracking
- Distributed execution, both on and off cluster
- Support other dataframe libraries such as HF Datasets, Polars, Spark
Expand Down
49 changes: 22 additions & 27 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,41 @@
# Getting started

Have a look at this page to learn how to run your first Fondant pipeline. It provides instructions for installing, executing a sample pipeline, and visually exploring the pipeline results using Fondant on your local machine.
Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system.

## Prerequisite
In this example, we will utilise Fondant's LocalRunner, which leverages docker compose for the pipeline execution. Therefore, it's important to ensure that docker compose is correctly installed.
Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images .

## Some things to pay attention to

For M1/M2 ship users:
- Make sure that Docker uses linux/amd64 platform and not arm64.
- In Docker Dashboards’ Settings<Features in development, make sure to uncheck `Use containerid for pulling and storing images` .
For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder.

## Installation
We suggest that you use a virtual environment for your project. Fondant supports Python >=3.8.
To install Fondant via Pip, run:
This pipeline loads an image dataset and reduces the dataset to png files. For more details on how you can build this pipeline from scratch, check out our [guide](/docs/guides/build_a_simple_pipeline.md).

Install Fondant by running:
```
pip install fondant
```

You can validate the installation of fondant by running its root CLI command:

Clone the Fondant GitHub repository
```
fondant --help
git clone https://github.com/ml6team/fondant.git
```

## Demo
For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder.

This pipeline loads an image dataset and reduces the dataset to png files. For more details on how you can build this pipeline from scratch, check out our [guide](/docs/guides/build_a_simple_pipeline.md).

## Running the sample pipeline and explore the data
After navigating to the pipeline directory, we can run the pipeline by using the LocalRunner as follow:
Make sure that Docker Compose is running, navigate to fondant/examples/pipelines/filter-cc-25m, and initiate the pipeline by executing:
```
fondant run pipeline --local
```

The sample pipeline will run and execute three steps, which you can monitor in the logs. It will load data from HuggingFace, filter out images, and then download those images. The pipeline results will be saved to parquet files. If you wish to visually explore the results, you can use the data explorer.
The following command will start the data explorer:
Note: For local testing purposes, the pipeline will only download the first 100,000 images. If you want to download the full dataset, you will need to modify the component arguments in the pipeline.py file, specifically the following part:
```python
load_from_hf_hub = ComponentOp(
component_dir="components/load_from_hf_hub",
arguments={
"dataset_name": "fondant-ai/fondant-cc-25m",
"column_name_mapping": load_component_column_mapping,
"n_rows_to_load": <HERE INSERT THE NUMBER OF IMAGES YOU WANT TO DOWNLOAD>
},
)
```
To visually inspect the results quickly, you can use:
```
fondant explore --base_path <base_path_dir>
fondant explore --base_path ./data
```

### Custom pipelines
Fondant enables you to leverage existing reusable components and integrate them with custom components. To delve deeper into creating your own pipelines, please explore our [guide](/docs/guides/build_a_simple_pipeline.md). There, you will gain insights into components, various component types, and how to effectively utilise them.
Fondant enables you to leverage existing reusable components and integrate them with custom components. To delve deeper into creating your own pipelines, please explore our [guide](/docs/guides/build_a_simple_pipeline.md). There, you will gain insights into components, various component types, and how to effectively utilise them.
10 changes: 6 additions & 4 deletions docs/guides/build_a_simple_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

We present a walkthrough to build by yourself the pipeline presented in the Getting Started section. Have fun!

**Level**: Beginner
**Time**: 20min
**Goal**: After completing this tutorial with Fondant, you will be able to understand the different elements of a pipeline, build, and execute your first pipeline by using existing components.
**Level**: Beginner </br>
**Time**: 20min </br>
**Goal**: After completing this tutorial with Fondant, you will be able to understand the different elements of a pipeline, build, and execute your first pipeline by using existing components. </br>

**Prerequisite**: Make sure docker compose is installed on your local system

Expand All @@ -30,6 +30,7 @@ base_path="./data" # The directory that will be used to store the data


All you need to initialise a Fondant pipeline are two key parameters:

- **pipeline_name**: This is a name you can use to reference your pipeline. In this example, we've named it after the creative commons-licensed dataset used in the pipeline.
- **base_path**: This is the base path that Fondant should use for storing artifacts and data. In our case, it's a local directory path. However, it can also be a path to a remote storage bucket provided by a cloud service. Please note that the directory you reference must exist; if it doesn't, make sure to create it.

Expand All @@ -38,6 +39,7 @@ All you need to initialise a Fondant pipeline are two key parameters:
Now it's time to incrementally build our pipeline by adding different execution steps. We refer to these steps as `Components`. Components are executable elements of a pipeline that consume and produce dataframes. The components are defined by a component specification. The component specification is a YAML file that outlines the input and output data structures, along with the arguments utilised by the component and a reference the the docker image used to run the component.

Fondant offers three distinct component types:

- **Reusable components**: These can be readily used without modification.
- **Generic components**: They provide the business logic but may require adjustments to the component spec.
- **Custom components**: The component implementation is user-dependent.
Expand Down Expand Up @@ -165,7 +167,7 @@ Finally, we add the component to the pipeline using the `add_op` method. Notably

Now, you can proceed to execute your pipeline once more and explore the results. In the explorer, you will be able to view the images that have been downloaded.

![explorer](/docs/art/guides/explorer.png)
![explorer](/art/guides/explorer.png)



Expand Down
8 changes: 4 additions & 4 deletions docs/guides/implement_custom_components.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Guide - Implement custom components

**Level**: Beginner
**Time**: 20min
**Goal**: After completing this tutorial with Fondant, you will be able to build your own custom component and integrate it into a fondant pipeline.
**Level**: Beginner </br>
**Time**: 20min </br>
**Goal**: After completing this tutorial with Fondant, you will be able to build your own custom component and integrate it into a fondant pipeline. </br>

**Prerequisite**: Make sure docker compose is installed on your local system.
We recommend completing the [first tutorial](/docs/guides/build_a_simple_pipeline.md) before proceeding with this one, as this tutorial builds upon the knowledge gained in the previous one.
Expand All @@ -22,7 +22,7 @@ This pipeline is an extension of the one introduced in the first tutorial. After

A component comprises several key elements. First, there's the ComponentSpec YAML file, serving as a blueprint for the component. It defines crucial aspects such as input and output dataframes, along with component arguments.

![component architecture](/docs/art/guides/component.png)
![component architecture](/art/guides/component.png)

The second essential part is a python class, which encapsulates the business logic that operates on the input dataframe.

Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ nav:
- Data explorer: data_explorer.md
- Infrastructure: infrastructure.md
- Manifest: manifest.md
- Contributing: contributing.md

plugins:
- mkdocstrings
Expand All @@ -61,3 +60,4 @@ markdown_extensions:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
- admonition
- def_list

0 comments on commit f13da17

Please sign in to comment.