Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hub links in documentation after adding new components #720

Merged
merged 4 commits into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,5 +68,5 @@ repos:
name: Generate component READMEs
language: python
entry: python scripts/component_readme/generate_readme.py
files: ^components/.*/fondant_component.yaml
files: ^components/[^/]*/fondant_component.yaml
additional_dependencies: ["fondant@git+https://github.com/ml6team/fondant@main", "Jinja2==3.1.2"]
2 changes: 1 addition & 1 deletion components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ description: |

image: fndnt/download_images:dev
tags:
- Image processing
- Data retrieval

consumes:
image_url:
Expand Down
2 changes: 1 addition & 1 deletion components/load_from_csv/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Load from csv file
# Load from csv

### Description
Component that loads a dataset from a csv file
Expand Down
2 changes: 1 addition & 1 deletion components/load_from_csv/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Load from csv file
name: Load from csv
description: Component that loads a dataset from a csv file
image: fndnt/load_from_csv:dev
tags:
Expand Down
2 changes: 1 addition & 1 deletion components/load_from_hf_hub/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Load from hub
# Load from Hugging Face hub

### Description
Component that loads a dataset from the hub
Expand Down
2 changes: 1 addition & 1 deletion components/load_from_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Load from hub
name: Load from Hugging Face hub
description: Component that loads a dataset from the hub
image: fndnt/load_from_hf_hub:dev
tags:
Expand Down
29 changes: 29 additions & 0 deletions components/load_with_llamahub/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM --platform=linux/amd64 python:3.8-slim as base

# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/

FROM base as test
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]

56 changes: 56 additions & 0 deletions components/load_with_llamahub/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Load with LlamaHub

### Description
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).


### Inputs / outputs

**This component consumes no data.**

**This component produces no data.**

### Arguments

The component takes the following arguments to alter its behavior:

| argument | type | description | default |
| -------- | ---- | ----------- | ------- |
| loader_class | str | The name of the LlamaIndex loader class to use. Make sure to provide the name and not the id. The name is passed to `llama_index.download_loader` to download the specified loader. | / |
| loader_kwargs | str | Keyword arguments to pass when instantiating the loader class. Check the documentation of the loader to check which arguments it accepts. | / |
| load_kwargs | str | Keyword arguments to pass to the `.load()` method of the loader. Check the documentation ofthe loader to check which arguments it accepts. | / |
| additional_requirements | list | Some loaders require additional dependencies to be installed. You can specify those here. Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately additional requirements for LlamaIndex loaders are not documented well, but if a dependencyis missing, a clear error message will be thrown. | / |
| n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
| index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |

### Usage

You can add this component to your pipeline using the following code:

```python
from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
"load_with_llamahub",
arguments={
# Add arguments
# "loader_class": ,
# "loader_kwargs": ,
# "load_kwargs": ,
# "additional_requirements": [],
# "n_rows_to_load": 0,
# "index_column": ,
}
)
```

### Testing

You can run the tests using docker with BuildKit. From this directory, run:
```
docker build . --target test
```
47 changes: 47 additions & 0 deletions components/load_with_llamahub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Load with LlamaHub
description: |
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).
image: fndnt/load_with_llamahub:dev
tags:
- Data loading

produces:
additionalProperties: true

args:
loader_class:
description: |
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the
id. The name is passed to `llama_index.download_loader` to download the specified loader.
type: str
loader_kwargs:
description: |
Keyword arguments to pass when instantiating the loader class. Check the documentation of
the loader to check which arguments it accepts.
type: str
load_kwargs:
description: |
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
the loader to check which arguments it accepts.
type: str
additional_requirements:
description: |
Some loaders require additional dependencies to be installed. You can specify those here.
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately
additional requirements for LlamaIndex loaders are not documented well, but if a dependency
is missing, a clear error message will be thrown.
type: list
default: []
n_rows_to_load:
description: |
Optional argument that defines the number of rows to load. Useful for testing pipeline runs
on a small scale
type: int
default: None
index_column:
description: |
Column to set index to in the load component, if not specified a default globally unique
index will be set
type: str
default: None
1 change: 1 addition & 0 deletions components/load_with_llamahub/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
llama-index==0.9.9
110 changes: 110 additions & 0 deletions components/load_with_llamahub/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import logging
import subprocess
import sys
import typing as t
from collections import defaultdict

import dask.dataframe as dd
import pandas as pd
from fondant.component import DaskLoadComponent
from fondant.core.component_spec import ComponentSpec
from llama_index import download_loader

logger = logging.getLogger(__name__)


class LlamaHubReader(DaskLoadComponent):
def __init__(
self,
spec: ComponentSpec,
*,
loader_class: str,
loader_kwargs: dict,
load_kwargs: dict,
additional_requirements: t.List[str],
n_rows_to_load: t.Optional[int] = None,
index_column: t.Optional[str] = None,
) -> None:
"""
Args:
spec: the component spec
loader_class: The name of the LlamaIndex loader class to use
loader_kwargs: Keyword arguments to pass when instantiating the loader class
load_kwargs: Keyword arguments to pass to the `.load()` method of the loader
additional_requirements: Additional Python requirements to install
n_rows_to_load: optional argument that defines the number of rows to load.
Useful for testing pipeline runs on a small scale.
index_column: Column to set index to in the load component, if not specified a default
globally unique index will be set.
"""
self.n_rows_to_load = n_rows_to_load
self.index_column = index_column
self.spec = spec

self.install_additional_requirements(additional_requirements)

loader_cls = download_loader(loader_class)
self.loader = loader_cls(**loader_kwargs)
self.load_kwargs = load_kwargs

@staticmethod
def install_additional_requirements(additional_requirements: t.List[str]):
for requirement in additional_requirements:
subprocess.check_call( # nosec
[sys.executable, "-m", "pip", "install", requirement],
)

def set_df_index(self, dask_df: dd.DataFrame) -> dd.DataFrame:
if self.index_column is None:
logger.info(
"Index column not specified, setting a globally unique index",
)

def _set_unique_index(dataframe: pd.DataFrame, partition_info=None):
"""Function that sets a unique index based on the partition and row number."""
dataframe["id"] = 1
dataframe["id"] = (
str(partition_info["number"])
+ "_"
+ (dataframe.id.cumsum()).astype(str)
)
dataframe.index = dataframe.pop("id")
return dataframe

def _get_meta_df() -> pd.DataFrame:
meta_dict = {"id": pd.Series(dtype="object")}
for field_name, field in self.spec.produces.items():
meta_dict[field_name] = pd.Series(
dtype=pd.ArrowDtype(field.type.value),
)
return pd.DataFrame(meta_dict).set_index("id")

meta = _get_meta_df()
dask_df = dask_df.map_partitions(_set_unique_index, meta=meta)
else:
logger.info(f"Setting `{self.index_column}` as index")
dask_df = dask_df.set_index(self.index_column, drop=True)

return dask_df

def load(self) -> dd.DataFrame:
try:
documents = self.loader.lazy_load_data(**self.load_kwargs)
except NotImplementedError:
documents = self.loader.load_data(**self.load_kwargs)

doc_dict = defaultdict(list)
for d, document in enumerate(documents):
for column in self.spec.produces:
if column == "text":
doc_dict["text"].append(document.text)
else:
doc_dict[column].append(document.metadata.get(column))

if d == self.n_rows_to_load:
break

dask_df = dd.from_dict(doc_dict, npartitions=1)

dask_df = self.set_df_index(dask_df)
return dask_df
35 changes: 35 additions & 0 deletions components/load_with_llamahub/tests/component_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from pathlib import Path

import yaml
from fondant.core.component_spec import ComponentSpec

from src.main import LlamaHubReader


def test_arxiv_reader():
"""Test the component with the ArxivReader.

This test requires a stable internet connection, both to download the loader, and to download
the papers from Arxiv.
"""
with open(Path(__file__).with_name("fondant_component.yaml")) as f:
spec = yaml.safe_load(f)
spec = ComponentSpec(spec)

component = LlamaHubReader(
spec=spec,
loader_class="ArxivReader",
loader_kwargs={},
load_kwargs={
"search_query": "jeff dean",
"max_results": 5,
},
additional_requirements=["pypdf"],
n_rows_to_load=None,
index_column=None,
)

output_dataframe = component.load().compute()

assert len(output_dataframe) > 0
assert output_dataframe.columns.tolist() == ["text", "URL", "Title of this paper"]
50 changes: 50 additions & 0 deletions components/load_with_llamahub/tests/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Load with LlamaHub
description: |
Load data using a LlamaHub loader. For available loaders, check the
[LlamaHub](https://llamahub.ai/).
image: ghcr.io/ml6team/load_with_llamahub:dev

produces:
text:
type: string
URL:
type: string
Title of this paper:
type: string

args:
loader_class:
description: |
The name of the LlamaIndex loader class to use. Make sure to provide the name and not the
id. The name is passed to `llama_index.download_loader` to download the specified loader.
type: str
loader_kwargs:
description: |
Keyword arguments to pass when instantiating the loader class. Check the documentation of
the loader to check which arguments it accepts.
type: str
load_kwargs:
description: |
Keyword arguments to pass to the `.load()` method of the loader. Check the documentation of
the loader to check which arguments it accepts.
type: str
additional_requirements:
description: |
Some loaders require additional dependencies to be installed. You can specify those here.
Use a format accepted by `pip install`. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately
additional requirements for LlamaIndex loaders are not documented well, but if a dependency
is missing, a clear error message will be thrown.
type: list
default: []
n_rows_to_load:
description: |
Optional argument that defines the number of rows to load. Useful for testing pipeline runs
on a small scale
type: int
default: None
index_column:
description: |
Column to set index to in the load component, if not specified a default globally unique
index will be set
type: str
default: None
2 changes: 2 additions & 0 deletions components/load_with_llamahub/tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
pythonpath = ../src
1 change: 1 addition & 0 deletions components/load_with_llamahub/tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pytest==7.4.2
2 changes: 1 addition & 1 deletion components/write_to_hf_hub/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Write to hub
# Write to Hugging Face hub

### Description
Component that writes a dataset to the hub
Expand Down
2 changes: 1 addition & 1 deletion components/write_to_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Write to hub
name: Write to Hugging Face hub
description: Component that writes a dataset to the hub
image: fndnt/write_to_hf_hub:dev
tags:
Expand Down
Loading
Loading