feat: enable web scraping to parse and save pdf content #474

Rob-Powell · 2024-04-25T07:22:09Z

Issue #, if available:

Description of changes:
Ability to now have the webcrawler crawl and parse PDFs in addition to the existing capability to crawl text/html files.

This change adds a content types parameter to the gui to enable users to decide whether they want only text/html content scraped or if they want to also include 'application/pdf' files as well. This feature also makes it easier to add other content types in the future if desired.

Additionally I also bumped and tested the pydantic versions as per dependabot as I was here and testing this code anyway.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Rob-Powell force-pushed the feat-crawl-pdfs branch from 2053210 to 24f77bd Compare May 2, 2024 03:03

feat: enable web scraping to parse and save pdf content

599c59e

Rob-Powell force-pushed the feat-crawl-pdfs branch from 24f77bd to 599c59e Compare May 12, 2024 09:16

Merge branch 'main' into feat-crawl-pdfs

dd1db5a

bigadsoleiman approved these changes Jun 10, 2024

View reviewed changes

bigadsoleiman merged commit d3e4336 into aws-samples:main Jun 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable web scraping to parse and save pdf content #474

feat: enable web scraping to parse and save pdf content #474

Rob-Powell commented Apr 25, 2024

feat: enable web scraping to parse and save pdf content #474

feat: enable web scraping to parse and save pdf content #474

Conversation

Rob-Powell commented Apr 25, 2024