Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Datacomp at scale #319

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
f1dbf76
More fixes
NielsRogge Jul 25, 2023
cd1c7fc
More improvements
NielsRogge Jul 25, 2023
a4fb134
More improvements
NielsRogge Jul 25, 2023
dc83431
Add logging
NielsRogge Jul 25, 2023
ee4a8e0
Update dockerfile
NielsRogge Jul 25, 2023
f3c040f
Fix dtype
NielsRogge Jul 25, 2023
18d7f9a
Update Dockerfile
NielsRogge Jul 25, 2023
2b31bfc
More updates
NielsRogge Jul 26, 2023
471dc43
Update logging
NielsRogge Jul 26, 2023
8fa0218
More improvements
NielsRogge Jul 26, 2023
d3165ac
Update specs
NielsRogge Jul 26, 2023
3a5179d
Improve load_from_hf_hub component
NielsRogge Jul 26, 2023
f436c72
Update specs
NielsRogge Jul 26, 2023
e4986ca
Add task graph
NielsRogge Jul 26, 2023
b09abac
Add graphviz to the dependencies
NielsRogge Jul 26, 2023
ed18fb9
Update Dockerfile
NielsRogge Jul 26, 2023
ddc2ca7
Add more
NielsRogge Jul 26, 2023
68122c8
Add visualize
NielsRogge Jul 26, 2023
a6f6498
More improvements
NielsRogge Jul 26, 2023
5938106
Fix visualization
NielsRogge Jul 26, 2023
4fdf320
Remove line
NielsRogge Jul 26, 2023
57f210e
More improvements
NielsRogge Jul 26, 2023
01b1cd2
Add print statements
NielsRogge Jul 26, 2023
7a055d1
More improvements
NielsRogge Jul 27, 2023
bb08810
More improvements
NielsRogge Jul 27, 2023
eb7550e
Comment out code
NielsRogge Jul 27, 2023
eba1214
More improvements
NielsRogge Jul 27, 2023
4704b88
Remove print statements
NielsRogge Jul 27, 2023
9db9511
Fix repartioning
NielsRogge Jul 28, 2023
85ca6b6
More improvements
NielsRogge Jul 28, 2023
9195f94
More improvements
NielsRogge Jul 28, 2023
3c9ea91
Add download images component
NielsRogge Aug 1, 2023
06b316c
Update script
NielsRogge Aug 1, 2023
7db2865
Remove graphviz
NielsRogge Aug 1, 2023
a366ee0
More improvements
NielsRogge Aug 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions components/filter_image_resolution/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ FROM --platform=linux/amd64 python:3.8-slim
# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y
apt-get install git -y && \
apt-get install graphviz -y
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgesLorre do we still need to revert the docker image to main or do they get automatically tagged as dev during merge?

RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
6 changes: 3 additions & 3 deletions components/filter_image_resolution/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
name: Filter image resolution
description: Component that filters images based on minimum size and max aspect ratio
image: ghcr.io/ml6team/filter_image_resolution:latest
image: ghcr.io/ml6team/filter_image_resolution:f3f3925b8e8f634e2978e5c7fcefa72c53baba7c

consumes:
image:
fields:
width:
type: int16
type: int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int16 maxes out at 32.767 which is why you may be having issues if your image's width was 36.000
However, I think this should be resolved if you move to int32

height:
type: int16
type: int64

args:
min_image_dim:
Expand Down
5 changes: 3 additions & 2 deletions components/load_from_hf_hub/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ FROM --platform=linux/amd64 python:3.8-slim
# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y
apt-get install git -y && \
apt-get install graphviz -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
9 changes: 7 additions & 2 deletions components/load_from_hf_hub/fondant_component.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Load from hub
description: Component that loads a dataset from the hub
image: ghcr.io/ml6team/load_from_hf_hub:dev
image: ghcr.io/ml6team/load_from_hf_hub:f3f3925b8e8f634e2978e5c7fcefa72c53baba7c

produces:
dummy_variable: #TODO: fill in here
Expand All @@ -23,4 +23,9 @@ args:
n_rows_to_load:
description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale
type: int
default: None
default: None
dataset_length:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid having the user specify the length of the dataset and read it from the hf metadata directly

description: Optional argument that defines the length of the dataset. Required in case `n_rows_to_load` is specified.
type: int
default: None

25 changes: 22 additions & 3 deletions components/load_from_hf_hub/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ def __init__(self, *_,
column_name_mapping: dict,
image_column_names: t.Optional[list],
n_rows_to_load: t.Optional[int],
dataset_length: int,
) -> None:
"""
Args:
Expand All @@ -25,11 +26,14 @@ def __init__(self, *_,
format the image from HF hub format to a byte string
n_rows_to_load: optional argument that defines the number of rows to load. Useful for
testing pipeline runs on a small scale.
dataset_length: optional argument that specifies the length of the entire dataset. Only
required in case n_rows_to_load is specified.
"""
self.dataset_name = dataset_name
self.column_name_mapping = column_name_mapping
self.image_column_names = image_column_names
self.n_rows_to_load = n_rows_to_load
self.dataset_length = dataset_length

def load(self) -> dd.DataFrame:
# 1) Load data, read as Dask dataframe
Expand All @@ -44,12 +48,27 @@ def load(self) -> dd.DataFrame:
)

# 3) Rename columns
logger.info("Renaming columns...")
dask_df = dask_df.rename(columns=self.column_name_mapping)

# 4) Optional: only return specific amount of rows
if self.n_rows_to_load:
dask_df = dask_df.head(self.n_rows_to_load)
dask_df = dd.from_pandas(dask_df, npartitions=1)
if self.n_rows_to_load is not None:
if self.dataset_length is None:
raise ValueError("""Make sure to also specify the length of the entire
dataset. This is required as otherwise only the first
partition can be loaded""")
logger.info(f"Loading approximately {self.n_rows_to_load} rows...")
partition_length = self.dataset_length // dask_df.npartitions
npartitions = self.n_rows_to_load // partition_length
dask_df = dask_df.head(self.n_rows_to_load, npartitions=npartitions)
dask_df = dd.from_pandas(dask_df, npartitions=npartitions)
# .reset_index(drop=True) # will reset it from 0 for every partition

# Set monotonically increasing index
logger.info("Setting the index...")
dask_df["id"] = 1
dask_df["id"] = dask_df.id.cumsum()
dask_df = dask_df.set_index("id", sort=True)
Comment on lines +68 to +72
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return dask_df

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=79df895e9d62d2010ccb8d40ee7e4fd4c68f117d
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
class ClusterImageEmbeddingsComponent(DaskTransformComponent):
"""Component that clusters images based on embeddings."""

def __init__(self, sample_ratio: float, num_clusters: int) -> None:
def __init__(self, *_, sample_ratio: float, num_clusters: int) -> None:
self.sample_ratio = sample_ratio
self.num_clusters = num_clusters

Expand Down
23 changes: 23 additions & 0 deletions examples/pipelines/datacomp/components/download_images/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM --platform=linux/amd64 python:3.8-slim

# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]
12 changes: 12 additions & 0 deletions examples/pipelines/datacomp/components/download_images/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# download_images

### Description
This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width).
The images are stored in a new colum as bytes objects. This component also resizes the images using the [resizer](https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py) function from the img2dataset library.

If the component is unable to retrieve the image at a URL (for any reason), it will return `None` for that particular URL.

### **Inputs/Outputs**

See [`fondant_component.yaml`](fondant_component.yaml) for a more detailed description on all the input/output parameters.

Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Download images
description: Component that downloads images based on URLs
image: ghcr.io/ml6team/download_images:dev

consumes:
image:
fields:
url:
type: string

produces:
image:
fields:
data:
type: binary
width:
type: int16
height:
type: int16

args:
timeout:
description: Maximum time (in seconds) to wait when trying to download an image
type: int
default: 10
retries:
description: Number of times to retry downloading an image if it fails.
type: int
default: 0
image_size:
description: Size of the images after resizing.
type: int
default: 256
resize_mode:
description: Resize mode to use. One of "no", "keep_ratio", "center_crop", "border".
type: str
default: 'border'
resize_only_if_bigger:
description: If True, resize only if image is bigger than image_size.
type: bool
default: 'False'
min_image_size:
description: Minimum size of the images.
type: int
default: 0
max_aspect_ratio:
description: Maximum aspect ratio of the images.
type: float
default: 'inf'
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
albumentations==1.3.0
opencv-python-headless>=4.5.5.62,<5
Loading