[Commoncrawl pipeline] Add component download_commoncrawl_segments #273

shayorshay · 2023-07-05T15:06:32Z

This PR adds the second component of the Commoncrawl pipeline. The component downloads the WARC segment files and extracts the webpage urls and html code to be returned as a dask dataframe.

PhilippeMoussalli

Thanks Sharon :)
Looks good overall but just have a small doubt regarding the current scalability of this approach.

PhilippeMoussalli · 2023-07-06T07:11:39Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+BASE_URL = "https://data.commoncrawl.org/"
+
+
+def get_records_from_warc_file(warc_file: str, n_records_to_download: int) -> List:


The n_records_to_download should be an optional argument and below you should only break if it not None

PhilippeMoussalli · 2023-07-06T07:15:12Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+        A list of webpages.
+    """
+    logger.info(f"Processing WARC file from segment path: {warc_file}...")
+    records = []


do we risk running into out-of-memory issues if the number of extracted records is too large? right now it seems like we're collecting them all in a list and materializing them in memory before transforming them into a Dask Dataframe.

Materializing a list and then converting it to a dataframe is the recommended approach, since appending a dataframe is a lot more expensive. We might want to do this on a partition level though.

does that mean that we need one partition per WARC file to keep the dataframes as small as possible?

I have tested this component and run into the expected out-of-memory.
I have tried to download 1 segment with 30k records. Seems to be Philippes assumption is correct, that the records list didn't fit into memory in my case.

What do you think about using dask.delayed? Basically, create a delayed_object with a fixed size (e.g. 1000 records), append them to a dask dataframe, and afterwards initialise the dataframe from the delayed objects.

def load_record(record_counter, batch_size) -> list: # returning a list of delayed objects with a fixed size offset = record_counter * batch_size # use this to determine the starting point to read from counter = 0 records = [] ... for record in ArchiveIterator(response.raw, arc2warc=True): if counter >= offset: # read content and append to list if len(records) >= batch_size: return records delayed_data = [] for segment in segments: total_number_of_records = xxx batch_size = 1000 delayed_data.extend([delayed(load_record)(record_counter, batch) for record_counter in range(0, total_number_of_records, batch_size)]) # initialise the final dataframe, download will be executed dataframe = dd.from_delayed(delayed_data)

@mrchtr That's good idea, in fact Im also using chunking to create delayed objects for load_from_files component in order to tackle the same issue

PhilippeMoussalli · 2023-07-06T07:20:14Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/fondant_component.yaml

+      url:
+        type: string
+      html:
+        type: binary


why is this not a string as well?

RobbeSneyders · 2023-07-10T09:36:27Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+        A list of webpages.
+    """
+    logger.info(f"Processing WARC file from segment path: {warc_file}...")
+    records = []


Materializing a list and then converting it to a dataframe is the recommended approach, since appending a dataframe is a lot more expensive. We might want to do this on a partition level though.

RobbeSneyders · 2023-07-10T09:36:46Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+        Returns:
+            A Dask DataFrame containing the downloaded webpages.
+        """
+        segment_paths = df["segment_path"].to_bag()


What is the reason to use the bag interface here?

So we can use dask map to process each WARC in parallel. Using a list caused errors when flattening the results to a dask dataframe later.

Yes, but I mean compared to the dataframe API, where we have the apply function which behaves similarly.

mrchtr · 2023-07-10T15:28:31Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+    for record in ArchiveIterator(response.raw, arc2warc=True):
+        if record.rec_type == "response":
+            url = record.rec_headers.get_header("WARC-Target-URI")
+            content = (


Do you think it would be possible to add an optional possibility to return plain text instead of the html?
We could add an additional parameter if html or plain text should be returned. I think you don't need it for your use case, but for the large language model use case it would be super useful! :)

For the plain text transformation you could use BeautifulSoup. Some time ago I did something similar by using this code function:

def _convert_to_plain_text(html): """Convert html body into plain text. Making sure table rows are on seperate lines.""" soup = BeautifulSoup(html) body_content = soup.find("body") text = "" for e in body_content.descendants: if isinstance(e, str): text += e.strip() elif e.name in ['br', 'h1', 'h2', 'h3', 'h4', 'tr', 'th']: text += '\n' elif e.name in ['td']: text += ' ' elif e.name == 'p' and not any(parent.name == 'table' for parent in e.parents): text += '\n' elif e.name in ['td']: text += '\t' elif e.name == 'li': text += '\n- ' return text

Is soup.get_text() not sufficient?

Works as well. The custom code handles table a bit differently. Basically, try to keep the table structure line by line.

I see the custom code keeps the CSS content in the <style> tag. I'm assuming you don't want this. Will soup.get_text() be ok for you @mrchtr ? otherwise, we'll have a long list of tags to parse

soup.get_text() works! Even better as my custom code.

shayorshay · 2023-07-18T15:31:01Z

I updated the PR with the following changes:

option to convert html to plain text as suggested by @mrchtr
use apply() to process files
add partition_size as fix to Dask scalability issue
http and s3 clients for downloading files. We want to s3 in the end but for local testing, the http client is faster

@PhilippeMoussalli @RobbeSneyders

RobbeSneyders

Thanks @shayorshay, left some more comments.

@PhilippeMoussalli can you follow up this PR?

RobbeSneyders · 2023-07-20T18:46:19Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/fondant_component.yaml

+  use_s3:
+    description: Whether to use S3 to download the commoncrawl segment file. Set to True if you are running this component on an AWS cluster.
+    type: bool
+    default: 'False'


Suggested change

default: 'False'

default: false

Aren't you passing a string here instead of a boolean? Same for the argument below.

component_spec.json currently allows only strings and integers as default values

Let's update the schema then :)

RobbeSneyders · 2023-07-20T18:47:11Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/utils/text_utils.py

+
+logger = logging.getLogger(__name__)
+
+BASE_URL = "https://data.commoncrawl.org/"


I don't think this is used.

RobbeSneyders · 2023-07-20T18:50:54Z

...s/pipelines/commoncrawl/components/download_commoncrawl_segments/src/utils/download_utils.py

+
+
+def get_warc_file_using_requests(warc_file: str) -> requests.Response:
+    retry = 0


FYI, requests provides functionality for retry:
https://requests.readthedocs.io/en/latest/user/advanced/?highlight=retry#example-automatic-retries

RobbeSneyders · 2023-07-20T18:51:49Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+
+
+if __name__ == "__main__":
+    component = DownloadCommoncrawlSegments.from_args()


This is outdated since #302. Can you rebase on main and update this?

RobbeSneyders · 2023-07-20T18:52:40Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+    def transform(
+        self,
+        df: dd.DataFrame,
+        use_s3: Optional[bool] = False,


Since #302, arguments are passed to __init__ instead of transform. Can you rebase on main and update this?

RobbeSneyders · 2023-07-20T18:53:27Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+        if partition_size:
+            df = df.repartition(partition_size=f"{partition_size}MB")
+
+        df = df.reset_index(drop=True)


Why are you resetting the index here?

RobbeSneyders · 2023-07-20T18:54:48Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+            "webpage_html",
+        ]
+
+        if partition_size:


I don't think we need an argument for this. This can be 250MB at all times. I think it's mainly the partition size before the apply that might be useful to provide an argument for. Then the user can define a smaller size so the data will still fit into memory after the apply.

mrchtr · 2023-07-27T06:11:57Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/utils/text_utils.py

+    """
+    try:
+        soup = BeautifulSoup(html, "html.parser")
+        return soup.get_text()


Currently there seems to be a performance bottleneck by converting the page into plain text.
I have found a different library (specially designed for this use case) that seems to be faster and more robust.
Can you check out trafilatura? https://trafilatura.readthedocs.io/en/latest/

done. I'll update the PR to reflect the change.

shayorshay · 2023-07-27T17:20:47Z

I updated the PR to reflect the following changes:

html_text packages for extracting plain text as suggested by @mrchtr . this is significantly faster than beautifulsoup when we tested on kubeflow.
reverted to using a dask bag for downloading webpages. compared to dask.dataframe.apply(), this is much faster when I I tested them. I think this has to do with the flatten step at the end where we apply pd.Series.
updated the retry for requests according to @RobbeSneyders' suggestion.
updated the component to use new transform and executors.

RobbeSneyders

Thanks @shayorshay. Two final comments and we can merge.

reverted to using a dask bag for downloading webpages. compared to dask.dataframe.apply(), this is much faster when I I tested them. I think this has to do with the flatten step at the end where we apply pd.Series.

Can you explain this a bit more? I wouldn't really expect a bag to be faster here. Maybe it's due to the different schedulers they use.

RobbeSneyders · 2023-08-08T08:21:49Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/requirements.txt

Can you pin all the versions here?

I retested both approaches since there've been updates on how we handle repartitioning and you're right, the apply approach is faster than the bag. Here's the time comparisons:

without repartition on 10k webpages. bag - 01:12, apply - 00:56

with repartition on 10k webpage. bag - 01:19, apply - 01:11

We can use the apply approach so I'll update the PR

RobbeSneyders · 2023-08-09T15:01:18Z

examples/pipelines/commoncrawl/components/download_commoncrawl_segments/src/main.py

+
+        dataframe = dataframe.reset_index(drop=True)
+
+        logger.info(f"Downloaded {len(dataframe)} webpages from Commoncrawl.")


len() triggers a compute here. We might be able to use .size here instead, which is evaluated lazily, but not sure how we can combine it with logging. Also fine for me to just remove this.

Ok, I've removed it. I'll look into .size for local testing

mrchtr · 2023-08-10T04:52:34Z

Thanks for the update @shayorshay!

I have tested the component in depth and tried to make some changes to improve the performance slightly.

We should use trafilatura. The library offers fast and powerful html extraction including language filtering based on meta data, blocking urls, local deduplication of repetitions within a single document
The performance is already good. Nevertheless, machine cores are not used equally. I observed some performance peaks (repeating behaviour: CPU usage super high for some seconds, and goes almost down to zero, stays there for several seconds).

Maybe we could merge this PR already. I think it would be easier to incorporate new changes and improve the performance step wise.

) This PR adds the second component of the Commoncrawl pipeline. The component downloads the WARC segment files and extracts the webpage urls and html code to be returned as a dask dataframe.

Add component download_commoncrawl_segments

c090dfb

shayorshay requested review from RobbeSneyders and PhilippeMoussalli July 5, 2023 15:06

PhilippeMoussalli reviewed Jul 6, 2023

View reviewed changes

shayorshay and others added 3 commits July 10, 2023 09:24

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

3894396

Make n_records_to_download optional

7be318d

Change variable type

0b54637

RobbeSneyders reviewed Jul 10, 2023

View reviewed changes

mrchtr reviewed Jul 10, 2023

View reviewed changes

shayorshay and others added 3 commits July 18, 2023 16:51

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

fe14da1

Use apply for processing WARC files

3db5f7f

Add README

d20c3f9

shayorshay and others added 2 commits July 18, 2023 17:48

Fix typo

4e08a7b

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

ff3ef8b

RobbeSneyders reviewed Jul 20, 2023

View reviewed changes

mrchtr reviewed Jul 27, 2023

View reviewed changes

shayorshay and others added 5 commits July 27, 2023 09:38

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

233ebd7

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

f7a2f7a

Split fondant install from other packages

423abb0

Update download component

32e2aca

Remove unused lines

24ea5e9

shayorshay and others added 3 commits July 27, 2023 19:26

Add args

dd51130

Add logging

f774493

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

64abf3e

RobbeSneyders reviewed Aug 8, 2023

View reviewed changes

shayorshay and others added 2 commits August 9, 2023 10:31

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

e94b87e

Add package versions

2a80387

Use apply for processing webpages

6781861

RobbeSneyders reviewed Aug 9, 2023

View reviewed changes

shayorshay and others added 2 commits August 9, 2023 17:04

Remove logging for dataframe length

7f09636

Merge branch 'ml6team:main' into feature/commoncrawl-download-segments

4363a49

RobbeSneyders approved these changes Aug 10, 2023

View reviewed changes

RobbeSneyders merged commit ed73f1b into ml6team:main Aug 10, 2023
5 checks passed

		BASE_URL = "https://data.commoncrawl.org/"


		def get_records_from_warc_file(warc_file: str, n_records_to_download: int) -> List:


		logger = logging.getLogger(__name__)

		BASE_URL = "https://data.commoncrawl.org/"



		def get_warc_file_using_requests(warc_file: str) -> requests.Response:
		retry = 0



		if __name__ == "__main__":
		component = DownloadCommoncrawlSegments.from_args()


		dataframe = dataframe.reset_index(drop=True)

		logger.info(f"Downloaded {len(dataframe)} webpages from Commoncrawl.")

[Commoncrawl pipeline] Add component download_commoncrawl_segments #273

[Commoncrawl pipeline] Add component download_commoncrawl_segments #273

Conversation

shayorshay commented Jul 5, 2023

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satishjasthi Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrchtr Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay commented Jul 18, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay commented Jul 27, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shayorshay Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrchtr commented Aug 10, 2023 • edited Loading

satishjasthi Jul 25, 2023 •

edited

Loading

shayorshay Jul 10, 2023 •

edited

Loading

mrchtr Jul 11, 2023 •

edited

Loading

shayorshay Jul 27, 2023 •

edited

Loading

shayorshay Aug 9, 2023 •

edited

Loading

mrchtr commented Aug 10, 2023 •

edited

Loading