Run Datacomp at scale #319

NielsRogge · 2023-07-26T13:07:44Z

This PR includes updates made to the Fondant code base to make it run at scale for Datacomp

NielsRogge · 2023-08-01T08:29:41Z

components/load_from_hf_hub/src/main.py

+        # Set monotonically increasing index
+        logger.info("Setting the index...")
+        dask_df["id"] = 1
+        dask_df["id"] = dask_df.id.cumsum()
+        dask_df = dask_df.set_index("id", sort=True)


This is based on https://stackoverflow.com/questions/47571715/dask-create-strictly-increasing-index

PhilippeMoussalli

Thanks Niels!
I'll try to reproduce the steps and investigate the merging issue

PhilippeMoussalli · 2023-08-01T09:03:09Z

components/filter_image_resolution/fondant_component.yaml


 consumes:
  image:
    fields:
      width:
-        type: int16
+        type: int64


int16 maxes out at 32.767 which is why you may be having issues if your image's width was 36.000
However, I think this should be resolved if you move to int32

PhilippeMoussalli · 2023-08-01T09:06:16Z

components/filter_image_resolution/Dockerfile


 # Install requirements
 COPY requirements.txt /
 RUN pip3 install --no-cache-dir -r requirements.txt

 # Install Fondant
 # This is split from other requirements to leverage caching
-ARG FONDANT_VERSION=main
+ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c


@GeorgesLorre do we still need to revert the docker image to main or do they get automatically tagged as dev during merge?

PhilippeMoussalli · 2023-08-01T09:07:30Z

components/load_from_hf_hub/fondant_component.yaml

+    default: None
+  dataset_length:


I would avoid having the user specify the length of the dataset and read it from the hf metadata directly

PhilippeMoussalli · 2023-08-01T09:09:51Z

examples/pipelines/datacomp/components/download_images/src/main.py

@@ -0,0 +1,159 @@
+"""


What are the reasons for duplicating this component and not making the changes directly to the one in the registry?

I see that the download_images component has some hardcoded column names here:

fondant/components/download_images/src/main.py

Lines 146 to 149 in 4161b19

dataframe.columns = [

"images_data",

"images_width",

"images_height",

. However in my case they are "image" instead of "images"

components/filter_image_resolution/Dockerfile

PhilippeMoussalli · 2023-08-01T09:14:29Z

scripts/build_components.sh

@@ -75,7 +75,7 @@ for dir in "${components_to_build[@]}"; do

  echo "Updating the image version in the fondant_component.yaml with:"
  echo "${full_image_names[0]}"
-  sed -i "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml
+  sed -i '' "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml


what is the extra '' for?

This was required to run the script on MacOS: https://stackoverflow.com/questions/29081799/sed-1-invalid-command-code-f

PhilippeMoussalli · 2023-08-01T09:16:01Z

src/fondant/data_io.py

@@ -92,6 +93,9 @@ def _load_subset(self, subset_name: str, fields: t.List[str]) -> dd.DataFrame:

        subset_df = dd.read_parquet(remote_path, columns=fields)

+        logger.info(f"First few rows of subset {subset_name}:")


I guess this is only for debugging and can be omitted

PhilippeMoussalli · 2023-08-01T09:16:38Z

src/fondant/data_io.py

        for name, subset in self.component_spec.consumes.items():
            fields = list(subset.fields.keys())
            subset_df = self._load_subset(name, fields)
            # left joins -> filter on index
+            # make sure that dataframe has same number of partitions
+            # as subset
+            dataframe = dataframe.repartition(npartitions=subset_df.npartitions)


I'll have a look at this, eventually we can maybe even flag it as an issue to the Dask repo

PhilippeMoussalli · 2023-08-01T09:32:57Z

src/fondant/data_io.py

@@ -196,6 +206,12 @@ def write_dataframe(self, dataframe: dd.DataFrame) -> None:

        dataframe.index = dataframe.index.rename("id").astype("string")

+        # logging.info("Visualizing task graph...")


If I have custom code that I use for testing but don't want to commit I usually use

git add -p <filename>

which allows you to only commit chunks of the file, can be also done in the IDE.

Then if I want to switch to another branch and not lose the testing code, I stash the changes under a descriptive name

git stash save datacomp-scale-test

You can then fetch it back using

git stash apply <stash_id>

PhilippeMoussalli · 2023-08-01T09:33:54Z

src/fondant/schemas/common.json

@@ -7,6 +7,7 @@
        "int8",
        "int16",
        "int32",
+        "int64",


This can be kept, I would just check that previous component can run at int32 to reduce memory footprint

NielsRogge · 2023-08-08T08:18:57Z

Closing this PR in favor of a more up-to-date branch.

NielsRogge commented Aug 1, 2023

View reviewed changes

NielsRogge added 29 commits August 1, 2023 10:33

More fixes

f1dbf76

More improvements

cd1c7fc

More improvements

a4fb134

Add logging

dc83431

Update dockerfile

ee4a8e0

Fix dtype

f3c040f

Update Dockerfile

18d7f9a

More updates

2b31bfc

Update logging

471dc43

More improvements

8fa0218

Update specs

d3165ac

Improve load_from_hf_hub component

3a5179d

Update specs

f436c72

Add task graph

e4986ca

Add graphviz to the dependencies

b09abac

Update Dockerfile

ed18fb9

Add more

ddc2ca7

Add visualize

68122c8

More improvements

a6f6498

Fix visualization

5938106

Remove line

4fdf320

More improvements

57f210e

Add print statements

01b1cd2

More improvements

7a055d1

More improvements

bb08810

Comment out code

eb7550e

More improvements

eba1214

Remove print statements

4704b88

Fix repartioning

9db9511

NielsRogge added 2 commits August 1, 2023 10:33

More improvements

85ca6b6

More improvements

9195f94

NielsRogge force-pushed the fix_pipeline branch from 7be7e53 to 9195f94 Compare August 1, 2023 08:33

NielsRogge requested a review from PhilippeMoussalli August 1, 2023 08:34

Add download images component

3c9ea91

PhilippeMoussalli reviewed Aug 1, 2023

View reviewed changes

NielsRogge added 3 commits August 1, 2023 11:49

Update script

06b316c

Remove graphviz

7db2865

More improvements

a366ee0

NielsRogge closed this Aug 8, 2023

NielsRogge mentioned this pull request Aug 8, 2023

[DataComp] Run pipeline at scale #337

Closed

5 tasks

janvanlooyml6 deleted the fix_pipeline branch January 9, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Datacomp at scale #319

Run Datacomp at scale #319

NielsRogge commented Jul 26, 2023

NielsRogge Aug 1, 2023

PhilippeMoussalli left a comment

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

NielsRogge Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

NielsRogge Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

PhilippeMoussalli Aug 1, 2023

NielsRogge commented Aug 8, 2023

	dataframe.columns = [
	"images_data",
	"images_width",
	"images_height",

		@@ -92,6 +93,9 @@ def _load_subset(self, subset_name: str, fields: t.List[str]) -> dd.DataFrame:

		subset_df = dd.read_parquet(remote_path, columns=fields)

		logger.info(f"First few rows of subset {subset_name}:")

		@@ -196,6 +206,12 @@ def write_dataframe(self, dataframe: dd.DataFrame) -> None:

		dataframe.index = dataframe.index.rename("id").astype("string")

		# logging.info("Visualizing task graph...")

Run Datacomp at scale #319

Run Datacomp at scale #319

Conversation

NielsRogge commented Jul 26, 2023

Choose a reason for hiding this comment

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Aug 8, 2023