[Commoncrawl pipeline] Add component extract free-to-use images #282

shayorshay · 2023-07-10T09:03:05Z

This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.

RobbeSneyders · 2023-07-10T09:38:28Z

examples/pipelines/commoncrawl/components/extract_free_to_use_images/src/main.py

+    def setup(self, *args, **kwargs):
+        pass


I don't think you need to add this method if you don't implement it.

RobbeSneyders · 2023-07-10T09:39:35Z

examples/pipelines/commoncrawl/components/extract_free_to_use_images/src/main.py

+        results = []
+
+        for _, row in df.iterrows():
+            try:
+                webpage_url = row[("webpage", "url")]
+                webpage_html = row[("webpage", "html")]
+
+                image_info = get_image_info_from_webpage(webpage_url, webpage_html)
+                if image_info is not None:
+                    results.append(image_info)
+
+            except Exception as e:
+                logger.error(f"Error parsing HTML: {e}")
+                continue
+
+        flattened_results = [item for sublist in results for item in sublist]
+        logger.info(f"Length of flattened_results: {len(flattened_results)}")
+
+        df = pd.DataFrame(
+            flattened_results,
+            columns=[
+                ("image", "image_url"),
+                ("image", "alt_text"),
+                ("image", "webpage_url"),
+                ("image", "license_type"),
+                ("image", "license_location"),
+            ],
+        )


Can we rewrite this using the .apply() method?

Have you been able to have a look at this @shayorshay?

yes, its done. could you take a look? @RobbeSneyders

PhilippeMoussalli · 2023-07-10T09:54:02Z

examples/pipelines/commoncrawl/components/extract_free_to_use_images/src/main.py

+        A list of image urls and license metadata.
+    """
+
+    soup = BeautifulSoup(webpage_html, "html.parser")


perhaps this should go in the setup() method? Seems like you're now initializing this on every function call

doesn't look like BeautifulSoup has a a parser function without the constructor so I'm leaving it as it is

PhilippeMoussalli · 2023-07-10T09:56:14Z

examples/pipelines/commoncrawl/components/extract_free_to_use_images/fondant_component.yaml

@@ -0,0 +1,25 @@
+name: Extract image url and license from commoncrawl
+description: Component that extracts image url and license metadata from a dataframe of webpage urls and html codes
+image: ghcr.io/ml6team/extract_image_licenses:latest


can you rename the component folder to also be extract_image_licenses for consistency?

…use-images

RobbeSneyders · 2023-07-12T12:00:36Z

examples/pipelines/commoncrawl/components/extract_image_licenses/src/main.py

+                result_type="expand",
+            )
+            .explode(0)
+            .apply(pd.Series)


What does this do?

get_image_info_from_webpage returns a nested list of image urls per webpage. this flattens both and transforms the elements into rows. not sure if there's a cleaner way to do this

Can you check the typing of that function and the one it depends on? It says it returns a List[str], which I don't think is correct then.

RobbeSneyders · 2023-07-13T11:56:51Z

examples/pipelines/commoncrawl/components/extract_image_licenses/src/main.py

+                result_type="expand",
+            )
+            .explode(0)
+            .apply(pd.Series)


Can you check the typing of that function and the one it depends on? It says it returns a List[str], which I don't think is correct then.

RobbeSneyders · 2023-07-13T11:57:02Z

examples/pipelines/commoncrawl/components/extract_image_licenses/src/main.py

+            .apply(pd.Series)
+        )
+
+        df = df.dropna().reset_index(drop=True)


Why are you resetting the index?

The final dataframe keeps the id of the wepbpage but on a second thought, I think this is not needed since the image_url is the id we want anyway. I'll remove it

…use-images

RobbeSneyders

Thanks @shayorshay! Would be great if you could add a readme for the different components in your pipeline as well.

…eam#282) This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.

This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.

Add component to extract free-to-use images

7211c2a

shayorshay requested review from RobbeSneyders and PhilippeMoussalli July 10, 2023 09:03

shayorshay added 2 commits July 10, 2023 11:10

Change variable type to string

54af3f9

Remove whitespace

888ae71

RobbeSneyders reviewed Jul 10, 2023

View reviewed changes

PhilippeMoussalli reviewed Jul 10, 2023

View reviewed changes

shayorshay and others added 5 commits July 11, 2023 14:19

Merge branch 'ml6team:main' into feature/commoncrawl-extract-free-to-…

5580f17

…use-images

Use apply for processing files

9780fb9

Rename component directory

38276ac

Merge branch 'ml6team:main' into feature/commoncrawl-extract-free-to-…

14e9b0a

…use-images

Use apply for processing dataframe

92dfd33

RobbeSneyders reviewed Jul 12, 2023

View reviewed changes

RobbeSneyders reviewed Jul 13, 2023

View reviewed changes

shayorshay and others added 3 commits July 14, 2023 10:12

Remove index reset

9613343

Merge branch 'ml6team:main' into feature/commoncrawl-extract-free-to-…

e7f0c0b

…use-images

Fix typing

74f362c

RobbeSneyders approved these changes Jul 17, 2023

View reviewed changes

add README

b80cdaf

RobbeSneyders merged commit 79df895 into ml6team:main Jul 20, 2023

Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023

[Commoncrawl pipeline] Add component extract free-to-use images (#282)

b8bfaef

This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Commoncrawl pipeline] Add component extract free-to-use images #282

[Commoncrawl pipeline] Add component extract free-to-use images #282

shayorshay commented Jul 10, 2023

RobbeSneyders Jul 10, 2023

RobbeSneyders Jul 10, 2023

RobbeSneyders Jul 11, 2023

shayorshay Jul 12, 2023 •

edited

Loading

PhilippeMoussalli Jul 10, 2023

shayorshay Jul 11, 2023

PhilippeMoussalli Jul 10, 2023

RobbeSneyders Jul 12, 2023

shayorshay Jul 12, 2023 •

edited

Loading

RobbeSneyders Jul 13, 2023

RobbeSneyders Jul 13, 2023

RobbeSneyders Jul 13, 2023

shayorshay Jul 14, 2023

RobbeSneyders left a comment

[Commoncrawl pipeline] Add component extract free-to-use images #282

[Commoncrawl pipeline] Add component extract free-to-use images #282

Conversation

shayorshay commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

shayorshay Jul 12, 2023 •

edited

Loading

shayorshay Jul 12, 2023 •

edited

Loading