-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Commoncrawl pipeline] Add component extract free-to-use images #282
[Commoncrawl pipeline] Add component extract free-to-use images #282
Conversation
def setup(self, *args, **kwargs): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to add this method if you don't implement it.
results = [] | ||
|
||
for _, row in df.iterrows(): | ||
try: | ||
webpage_url = row[("webpage", "url")] | ||
webpage_html = row[("webpage", "html")] | ||
|
||
image_info = get_image_info_from_webpage(webpage_url, webpage_html) | ||
if image_info is not None: | ||
results.append(image_info) | ||
|
||
except Exception as e: | ||
logger.error(f"Error parsing HTML: {e}") | ||
continue | ||
|
||
flattened_results = [item for sublist in results for item in sublist] | ||
logger.info(f"Length of flattened_results: {len(flattened_results)}") | ||
|
||
df = pd.DataFrame( | ||
flattened_results, | ||
columns=[ | ||
("image", "image_url"), | ||
("image", "alt_text"), | ||
("image", "webpage_url"), | ||
("image", "license_type"), | ||
("image", "license_location"), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite this using the .apply()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you been able to have a look at this @shayorshay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, its done. could you take a look? @RobbeSneyders
A list of image urls and license metadata. | ||
""" | ||
|
||
soup = BeautifulSoup(webpage_html, "html.parser") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this should go in the setup()
method? Seems like you're now initializing this on every function call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't look like BeautifulSoup
has a a parser function without the constructor so I'm leaving it as it is
@@ -0,0 +1,25 @@ | |||
name: Extract image url and license from commoncrawl | |||
description: Component that extracts image url and license metadata from a dataframe of webpage urls and html codes | |||
image: ghcr.io/ml6team/extract_image_licenses:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename the component folder to also be extract_image_licenses
for consistency?
result_type="expand", | ||
) | ||
.explode(0) | ||
.apply(pd.Series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_image_info_from_webpage
returns a nested list of image urls per webpage. this flattens both and transforms the elements into rows. not sure if there's a cleaner way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the typing of that function and the one it depends on? It says it returns a List[str]
, which I don't think is correct then.
result_type="expand", | ||
) | ||
.explode(0) | ||
.apply(pd.Series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the typing of that function and the one it depends on? It says it returns a List[str]
, which I don't think is correct then.
.apply(pd.Series) | ||
) | ||
|
||
df = df.dropna().reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you resetting the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final dataframe keeps the id of the wepbpage but on a second thought, I think this is not needed since the image_url
is the id we want anyway. I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shayorshay! Would be great if you could add a readme for the different components in your pipeline as well.
…eam#282) This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.
This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.
This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.