-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Datacomp at scale #319
Conversation
# Set monotonically increasing index | ||
logger.info("Setting the index...") | ||
dask_df["id"] = 1 | ||
dask_df["id"] = dask_df.id.cumsum() | ||
dask_df = dask_df.set_index("id", sort=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7be7e53
to
9195f94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niels!
I'll try to reproduce the steps and investigate the merging issue
|
||
consumes: | ||
image: | ||
fields: | ||
width: | ||
type: int16 | ||
type: int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int16 maxes out at 32.767 which is why you may be having issues if your image's width was 36.000
However, I think this should be resolved if you move to int32
|
||
# Install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Install Fondant | ||
# This is split from other requirements to leverage caching | ||
ARG FONDANT_VERSION=main | ||
ARG FONDANT_VERSION=f3f3925b8e8f634e2978e5c7fcefa72c53baba7c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgesLorre do we still need to revert the docker image to main
or do they get automatically tagged as dev
during merge?
default: None | ||
dataset_length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid having the user specify the length of the dataset and read it from the hf metadata directly
@@ -0,0 +1,159 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the reasons for duplicating this component and not making the changes directly to the one in the registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the download_images
component has some hardcoded column names here:
fondant/components/download_images/src/main.py
Lines 146 to 149 in 4161b19
dataframe.columns = [ | |
"images_data", | |
"images_width", | |
"images_height", |
@@ -75,7 +75,7 @@ for dir in "${components_to_build[@]}"; do | |||
|
|||
echo "Updating the image version in the fondant_component.yaml with:" | |||
echo "${full_image_names[0]}" | |||
sed -i "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml | |||
sed -i '' "s|^image: .*|image: ${full_image_names[0]}|" fondant_component.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the extra ''
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was required to run the script on MacOS: https://stackoverflow.com/questions/29081799/sed-1-invalid-command-code-f
@@ -92,6 +93,9 @@ def _load_subset(self, subset_name: str, fields: t.List[str]) -> dd.DataFrame: | |||
|
|||
subset_df = dd.read_parquet(remote_path, columns=fields) | |||
|
|||
logger.info(f"First few rows of subset {subset_name}:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is only for debugging and can be omitted
for name, subset in self.component_spec.consumes.items(): | ||
fields = list(subset.fields.keys()) | ||
subset_df = self._load_subset(name, fields) | ||
# left joins -> filter on index | ||
# make sure that dataframe has same number of partitions | ||
# as subset | ||
dataframe = dataframe.repartition(npartitions=subset_df.npartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at this, eventually we can maybe even flag it as an issue to the Dask
repo
src/fondant/data_io.py
Outdated
@@ -196,6 +206,12 @@ def write_dataframe(self, dataframe: dd.DataFrame) -> None: | |||
|
|||
dataframe.index = dataframe.index.rename("id").astype("string") | |||
|
|||
# logging.info("Visualizing task graph...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have custom code that I use for testing but don't want to commit I usually use
git add -p <filename>
which allows you to only commit chunks of the file, can be also done in the IDE.
Then if I want to switch to another branch and not lose the testing code, I stash the changes under a descriptive name
git stash save datacomp-scale-test
You can then fetch it back using
git stash apply <stash_id>
@@ -7,6 +7,7 @@ | |||
"int8", | |||
"int16", | |||
"int32", | |||
"int64", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be kept, I would just check that previous component can run at int32
to reduce memory footprint
Closing this PR in favor of a more up-to-date branch. |
This PR includes updates made to the Fondant code base to make it run at scale for Datacomp