Skip to content

Commit

Permalink
Use cleaner field names in reusable components (#679)
Browse files Browse the repository at this point in the history
This PR cleans up the field names in the reusable components. They were
just concatenated when first migrating away from the subsets.

I tested all components with tests, fixed the outdated tests, and
standardized the test directory structure. Each `tests` directory now
has a `pytest.ini` so the `PYTHONPATH` is set correctly both inside and
outside of docker, the `test_requirements.txt` was moved into the
`tests` directory, and the `Dockerfile` was updated accordingly.
  • Loading branch information
RobbeSneyders authored Nov 28, 2023
1 parent 6a84677 commit 197ac59
Show file tree
Hide file tree
Showing 122 changed files with 305 additions and 298 deletions.
4 changes: 1 addition & 3 deletions components/caption_images/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,10 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ This component captions images using a BLIP model from the Hugging Face hub

**This component consumes:**

- images_data: binary
- image: binary

**This component produces:**

- captions_text: string
- caption: string

### Arguments

Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ tags:
- Image processing

consumes:
images_data:
image:
type: binary

produces:
captions_text:
caption:
type: utf8

args:
Expand Down
4 changes: 2 additions & 2 deletions components/caption_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.max_new_tokens = max_new_tokens

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images_data"]
images = dataframe["image"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -112,4 +112,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(captions)

return pd.concat(results).to_frame(name=("captions_text"))
return pd.concat(results).to_frame(name="caption")
File renamed without changes.
4 changes: 2 additions & 2 deletions components/caption_images/tests/test_caption_images.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ def test_image_caption_component():
"https://cdn.pixabay.com/photo/2023/07/19/18/56/japanese-beetle-8137606_1280.png",
]
input_dataframe = pd.DataFrame(
{"images": {"data": [requests.get(url).content for url in image_urls]}},
{"image": [requests.get(url).content for url in image_urls]},
)

expected_output_dataframe = pd.DataFrame(
data={("captions", "text"): {0: "a motorcycle", 1: "a beetle"}},
data={"caption": {0: "a motorcycle", 1: "a beetle"}},
)

component = CaptionImagesComponent(
Expand Down
6 changes: 2 additions & 4 deletions components/chunk_text/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
6 changes: 3 additions & 3 deletions components/chunk_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ consists of the id of the original document followed by the chunk index.

**This component consumes:**

- text_data: string
- text: string

**This component produces:**

- text_data: string
- text_original_document_id: string
- text: string
- original_document_id: string

### Arguments

Expand Down
6 changes: 3 additions & 3 deletions components/chunk_text/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ tags:
- Text processing

consumes:
text_data:
text:
type: string

produces:
text_data:
text:
type: string
text_original_document_id:
original_document_id:
type: string

args:
Expand Down
4 changes: 2 additions & 2 deletions components/chunk_text/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
def chunk_text(self, row) -> t.List[t.Tuple]:
# Multi-index df has id under the name attribute
doc_id = row.name
text_data = row[("text_data")]
text_data = row["text"]
docs = self.text_splitter.create_documents([text_data])
return [
(doc_id, f"{doc_id}_{chunk_id}", chunk.page_content)
Expand All @@ -59,7 +59,7 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
# Turn into dataframes
results_df = pd.DataFrame(
results,
columns=["text_original_document_id", "id", "text_data"],
columns=["original_document_id", "id", "text"],
)
results_df = results_df.set_index("id")

Expand Down
6 changes: 3 additions & 3 deletions components/chunk_text/tests/chunk_text_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def test_transform():
"""Test chunk component method."""
input_dataframe = pd.DataFrame(
{
("text_data"): [
"text": [
"Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo",
"ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis",
"parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec,",
Expand All @@ -25,8 +25,8 @@ def test_transform():

expected_output_dataframe = pd.DataFrame(
{
("text_original_document_id"): ["a", "a", "a", "b", "b", "c", "c"],
("text_data"): [
"original_document_id": ["a", "a", "a", "b", "b", "c", "c"],
"text": [
"Lorem ipsum dolor sit amet, consectetuer",
"amet, consectetuer adipiscing elit. Aenean",
"elit. Aenean commodo",
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ right side is border-cropped image.

**This component produces:**

- images_data: binary
- images_width: int32
- images_height: int32
- image: binary
- image_width: int32
- image_height: int32

### Arguments

Expand All @@ -47,14 +47,14 @@ You can add this component to your pipeline using the following code:
from fondant.pipeline import ComponentOp


image_cropping_op = ComponentOp.from_registry(
name="image_cropping",
crop_images_op = ComponentOp.from_registry(
name="crop_images",
arguments={
# Add arguments
# "cropping_threshold": -30,
# "padding": 10,
}
)
pipeline.add_op(image_cropping_op, dependencies=[...]) #Add previous component as dependency
pipeline.add_op(crop_images_op, dependencies=[...]) #Add previous component as dependency
```

Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ consumes:
type: binary

produces:
images_data:
image:
type: binary
images_width:
image_width:
type: int32
images_height:
image_height:
type: int32

args:
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ def __init__(

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
# crop images
dataframe["images_data"] = dataframe["images_data"].apply(
dataframe["image"] = dataframe["image"].apply(
lambda image: remove_borders(image, self.cropping_threshold, self.padding),
)

# extract width and height
dataframe["images_width", "images_height"] = dataframe["images_data"].apply(
dataframe["image_width", "image_height"] = dataframe["image"].apply(
extract_dimensions,
axis=1,
result_type="expand",
Expand Down
6 changes: 2 additions & 4 deletions components/download_images/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
8 changes: 4 additions & 4 deletions components/download_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ from the img2dataset library.

**This component consumes:**

- images_url: string
- image_url: string

**This component produces:**

- images_data: binary
- images_width: int32
- images_height: int32
- image: binary
- image_width: int32
- image_height: int32

### Arguments

Expand Down
8 changes: 4 additions & 4 deletions components/download_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ tags:
- Image processing

consumes:
images_url:
image_url:
type: string

produces:
images_data:
image:
type: binary
images_width:
image_width:
type: int32
images_height:
image_height:
type: int32

args:
Expand Down
4 changes: 2 additions & 2 deletions components/download_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,14 +119,14 @@ async def download_dataframe() -> None:
images = await asyncio.gather(
*[
self.download_and_resize_image(id_, url, semaphore=semaphore)
for id_, url in zip(dataframe.index, dataframe["images_url"])
for id_, url in zip(dataframe.index, dataframe["image_url"])
],
)
results.extend(images)

asyncio.run(download_dataframe())

columns = ["id", "data", "width", "height"]
columns = ["id", "image", "image_width", "image_height"]
if results:
results_df = pd.DataFrame(results, columns=columns)
else:
Expand Down
2 changes: 2 additions & 0 deletions components/download_images/tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
pythonpath = ../src
2 changes: 2 additions & 0 deletions components/download_images/tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
pytest==7.4.0
respx==0.20.2
8 changes: 4 additions & 4 deletions components/download_images/tests/test_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_transform(respx_mock):

input_dataframe = pd.DataFrame(
{
"images_url": urls,
"image_url": urls,
},
index=pd.Index(ids, name="id"),
)
Expand All @@ -55,9 +55,9 @@ def test_transform(respx_mock):
resized_images = [component.resizer(io.BytesIO(image))[0] for image in images]
expected_dataframe = pd.DataFrame(
{
"images_data": resized_images,
"images_width": [image_size] * len(ids),
"images_height": [image_size] * len(ids),
"image": resized_images,
"image_width": [image_size] * len(ids),
"image_height": [image_size] * len(ids),
},
index=pd.Index(ids, name="id"),
)
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Component that generates CLIP embeddings from images

**This component consumes:**

- images_data: binary
- image: binary

**This component produces:**

- embeddings_data: list<item: float>
- embedding: list<item: float>

### Arguments

Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/fondant_component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ tags:
- Image processing

consumes:
images_data:
image:
type: binary

produces:
embeddings_data:
embedding:
type: array
items:
type: float32
Expand Down
4 changes: 2 additions & 2 deletions components/embed_images/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def __init__(
self.batch_size = batch_size

def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
images = dataframe["images_data"]
images = dataframe["image"]

results: t.List[pd.Series] = []
for batch in np.split(
Expand All @@ -110,4 +110,4 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
).T
results.append(embeddings)

return pd.concat(results).to_frame(name=("embeddings_data"))
return pd.concat(results).to_frame(name="embedding")
6 changes: 2 additions & 4 deletions components/embed_text/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,12 @@ RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team
# Set the working directory to the component folder
WORKDIR /component
COPY src/ src/
ENV PYTHONPATH "${PYTHONPATH}:./src"

FROM base as test
COPY test_requirements.txt .
RUN pip3 install --no-cache-dir -r test_requirements.txt
COPY tests/ tests/
RUN pip3 install --no-cache-dir -r tests/requirements.txt
RUN python -m pytest tests

FROM base
WORKDIR /component/src
ENTRYPOINT ["fondant", "execute", "main"]
ENTRYPOINT ["fondant", "execute", "main"]
5 changes: 2 additions & 3 deletions components/embed_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@ Component that generates embeddings of text passages.

**This component consumes:**

- text_data: string
- text: string

**This component produces:**

- text_data: string
- text_embedding: list<item: float>
- embedding: list<item: float>

### Arguments

Expand Down
Loading

0 comments on commit 197ac59

Please sign in to comment.