-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make download_component concurrent #354
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Robbe! Left some minor comments
What's the noticed speed increase?
return img_str, width, height | ||
return None, None, None | ||
return id_, img_str, width, height | ||
return id_, None, None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we drop the columns that are None
before writing the final dataframe using ddf.dropna()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have dropped nan columns in a different component as well. Isn't it something we can handle in the base components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only made it concurrent. Didn't want to change anything about the behavior. But I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a ticket for it
"width": "images_width", | ||
"height":"images_height"}) | ||
async def async_download(): | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we utilizing the maximumx capacity if we set max_workers=20
should we set it to by based on the cores in case we go for one of the heavy node pools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is per core. I haven't done any optimization on this yet.
Haven't done any benchmarking. I wanted to ask @NielsRogge to rerun his pipeline with this updated component. |
5e95118
to
3c2dc31
Compare
3c2dc31
to
debb07d
Compare
debb07d
to
c226814
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RobbeSneyders! I have come across two small nitpicks.
@@ -1,6 +1,6 @@ | |||
name: Download images | |||
description: Component that downloads images based on URLs | |||
image: ghcr.io/ml6team/download_images:dev | |||
image: ghcr.io/ml6team/download_images:e807f246f2f76a004522a55915c650f6af3a884d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to revert this line before merging to main.
) | ||
user_agent_string += " (compatible; +https://github.com/ml6team/fondant)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think that this string is used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will re-add it.
c226814
to
5c3ffec
Compare
5c3ffec
to
f76a7c7
Compare
import pandas as pd | ||
from httpx import Response | ||
|
||
from src.main import DownloadImagesComponent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using the abstract test class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed the test setup in #335 and decided to not use it anymore. We should probably remove the abstract class and update the tests that use it.
This PR aligns the test setup of the `text_normalization` component with the one of the `download_images` component introduced in #354
This PR makes the `download_images` component concurrent. This is just a quick fix, ideally we rewrite the component to use an async http client like httpx. I will pick this up as a separate PR.
This PR aligns the test setup of the `text_normalization` component with the one of the `download_images` component introduced in #354
This PR makes the
download_images
component concurrent.This is just a quick fix, ideally we rewrite the component to use an async http client like httpx. I will pick this up as a separate PR.