Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Datacomp at scale #319

Closed
wants to merge 35 commits into from
Closed

Run Datacomp at scale #319

wants to merge 35 commits into from

Conversation

NielsRogge
Copy link
Contributor

This PR includes updates made to the Fondant code base to make it run at scale for Datacomp

Comment on lines +67 to +71
# Set monotonically increasing index
logger.info("Setting the index...")
dask_df["id"] = 1
dask_df["id"] = dask_df.id.cumsum()
dask_df = dask_df.set_index("id", sort=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels!
I'll try to reproduce the steps and investigate the merging issue


consumes:
image:
fields:
width:
type: int16
type: int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int16 maxes out at 32.767 which is why you may be having issues if your image's width was 36.000
However, I think this should be resolved if you move to int32


# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgesLorre do we still need to revert the docker image to main or do they get automatically tagged as dev during merge?

default: None
dataset_length:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid having the user specify the length of the dataset and read it from the hf metadata directly

@@ -0,0 +1,159 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the reasons for duplicating this component and not making the changes directly to the one in the registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the download_images component has some hardcoded column names here:

dataframe.columns = [
"images_data",
"images_width",
"images_height",
. However in my case they are "image" instead of "images"

components/filter_image_resolution/Dockerfile Outdated Show resolved Hide resolved
@@ -75,7 +75,7 @@ for dir in "${components_to_build[@]}"; do

echo "Updating the image version in the fondant_component.yaml with:"
echo "${full_image_names[0]}"
sed -i "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml
sed -i '' "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the extra '' for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -92,6 +93,9 @@ def _load_subset(self, subset_name: str, fields: t.List[str]) -> dd.DataFrame:

subset_df = dd.read_parquet(remote_path, columns=fields)

logger.info(f"First few rows of subset {subset_name}:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is only for debugging and can be omitted

for name, subset in self.component_spec.consumes.items():
fields = list(subset.fields.keys())
subset_df = self._load_subset(name, fields)
# left joins -> filter on index
# make sure that dataframe has same number of partitions
# as subset
dataframe = dataframe.repartition(npartitions=subset_df.npartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at this, eventually we can maybe even flag it as an issue to the Dask repo

@@ -196,6 +206,12 @@ def write_dataframe(self, dataframe: dd.DataFrame) -> None:

dataframe.index = dataframe.index.rename("id").astype("string")

# logging.info("Visualizing task graph...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have custom code that I use for testing but don't want to commit I usually use

git add -p <filename>

which allows you to only commit chunks of the file, can be also done in the IDE.

Then if I want to switch to another branch and not lose the testing code, I stash the changes under a descriptive name

git stash save datacomp-scale-test

You can then fetch it back using

git stash apply <stash_id>

@@ -7,6 +7,7 @@
"int8",
"int16",
"int32",
"int64",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be kept, I would just check that previous component can run at int32 to reduce memory footprint

@NielsRogge
Copy link
Contributor Author

Closing this PR in favor of a more up-to-date branch.

@NielsRogge NielsRogge closed this Aug 8, 2023
@NielsRogge NielsRogge mentioned this pull request Aug 8, 2023
5 tasks
@janvanlooyml6 janvanlooyml6 deleted the fix_pipeline branch January 9, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants