Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Commoncrawl pipeline] Add component extract free-to-use images #282

Conversation

shayorshay
Copy link
Contributor

This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.

Comment on lines 50 to 51
def setup(self, *args, **kwargs):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to add this method if you don't implement it.

Comment on lines 60 to 87
results = []

for _, row in df.iterrows():
try:
webpage_url = row[("webpage", "url")]
webpage_html = row[("webpage", "html")]

image_info = get_image_info_from_webpage(webpage_url, webpage_html)
if image_info is not None:
results.append(image_info)

except Exception as e:
logger.error(f"Error parsing HTML: {e}")
continue

flattened_results = [item for sublist in results for item in sublist]
logger.info(f"Length of flattened_results: {len(flattened_results)}")

df = pd.DataFrame(
flattened_results,
columns=[
("image", "image_url"),
("image", "alt_text"),
("image", "webpage_url"),
("image", "license_type"),
("image", "license_location"),
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rewrite this using the .apply() method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you been able to have a look at this @shayorshay?

Copy link
Contributor Author

@shayorshay shayorshay Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its done. could you take a look? @RobbeSneyders

A list of image urls and license metadata.
"""

soup = BeautifulSoup(webpage_html, "html.parser")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this should go in the setup() method? Seems like you're now initializing this on every function call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like BeautifulSoup has a a parser function without the constructor so I'm leaving it as it is

@@ -0,0 +1,25 @@
name: Extract image url and license from commoncrawl
description: Component that extracts image url and license metadata from a dataframe of webpage urls and html codes
image: ghcr.io/ml6team/extract_image_licenses:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename the component folder to also be extract_image_licenses for consistency?

result_type="expand",
)
.explode(0)
.apply(pd.Series)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Contributor Author

@shayorshay shayorshay Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_image_info_from_webpage returns a nested list of image urls per webpage. this flattens both and transforms the elements into rows. not sure if there's a cleaner way to do this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the typing of that function and the one it depends on? It says it returns a List[str], which I don't think is correct then.

result_type="expand",
)
.explode(0)
.apply(pd.Series)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the typing of that function and the one it depends on? It says it returns a List[str], which I don't think is correct then.

.apply(pd.Series)
)

df = df.dropna().reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you resetting the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final dataframe keeps the id of the wepbpage but on a second thought, I think this is not needed since the image_url is the id we want anyway. I'll remove it

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shayorshay! Would be great if you could add a readme for the different components in your pipeline as well.

@RobbeSneyders RobbeSneyders merged commit 79df895 into ml6team:main Jul 20, 2023
satishjasthi pushed a commit to satishjasthi/fondant that referenced this pull request Jul 21, 2023
…eam#282)

This 3rd component extracts the image url, alt text and license metadata
from the webpage url and html code.
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This 3rd component extracts the image url, alt text and license metadata
from the webpage url and html code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants