Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make download_component concurrent #354

Merged
merged 9 commits into from
Aug 17, 2023

Conversation

RobbeSneyders
Copy link
Member

This PR makes the download_images component concurrent.

This is just a quick fix, ideally we rewrite the component to use an async http client like httpx. I will pick this up as a separate PR.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robbe! Left some minor comments
What's the noticed speed increase?

return img_str, width, height
return None, None, None
return id_, img_str, width, height
return id_, None, None, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we drop the columns that are None before writing the final dataframe using ddf.dropna()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have dropped nan columns in a different component as well. Isn't it something we can handle in the base components?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only made it concurrent. Didn't want to change anything about the behavior. But I think it makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a ticket for it

"width": "images_width",
"height":"images_height"})
async def async_download():
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we utilizing the maximumx capacity if we set max_workers=20 should we set it to by based on the cores in case we go for one of the heavy node pools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is per core. I haven't done any optimization on this yet.

@RobbeSneyders
Copy link
Member Author

Thanks Robbe! Left some minor comments What's the noticed speed increase?

Haven't done any benchmarking. I wanted to ask @NielsRogge to rerun his pipeline with this updated component.

@RobbeSneyders RobbeSneyders force-pushed the feature/concurrent-download-images branch from 5e95118 to 3c2dc31 Compare August 17, 2023 09:10
@NielsRogge NielsRogge force-pushed the feature/concurrent-download-images branch from 3c2dc31 to debb07d Compare August 17, 2023 09:48
@RobbeSneyders RobbeSneyders force-pushed the feature/concurrent-download-images branch from debb07d to c226814 Compare August 17, 2023 10:57
Copy link
Contributor

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RobbeSneyders! I have come across two small nitpicks.

@@ -1,6 +1,6 @@
name: Download images
description: Component that downloads images based on URLs
image: ghcr.io/ml6team/download_images:dev
image: ghcr.io/ml6team/download_images:e807f246f2f76a004522a55915c650f6af3a884d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to revert this line before merging to main.

)
user_agent_string += " (compatible; +https://github.com/ml6team/fondant)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that this string is used anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will re-add it.

@RobbeSneyders RobbeSneyders force-pushed the feature/concurrent-download-images branch from c226814 to 5c3ffec Compare August 17, 2023 11:35
@RobbeSneyders RobbeSneyders force-pushed the feature/concurrent-download-images branch from 5c3ffec to f76a7c7 Compare August 17, 2023 11:37
import pandas as pd
from httpx import Response

from src.main import DownloadImagesComponent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using the abstract test class ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the test setup in #335 and decided to not use it anymore. We should probably remove the abstract class and update the tests that use it.

@RobbeSneyders RobbeSneyders merged commit 970409e into main Aug 17, 2023
5 checks passed
@RobbeSneyders RobbeSneyders deleted the feature/concurrent-download-images branch August 17, 2023 13:31
GeorgesLorre pushed a commit that referenced this pull request Aug 18, 2023
This PR aligns the test setup of the `text_normalization` component with
the one of the `download_images` component introduced in #354
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR makes the `download_images` component concurrent.

This is just a quick fix, ideally we rewrite the component to use an
async http client like httpx. I will pick this up as a separate PR.
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR aligns the test setup of the `text_normalization` component with
the one of the `download_images` component introduced in #354
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants