Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable web scraping to parse and save pdf content #474

Merged
merged 2 commits into from
Jun 10, 2024

Conversation

Rob-Powell
Copy link
Contributor

Issue #, if available:

Description of changes:
Ability to now have the webcrawler crawl and parse PDFs in addition to the existing capability to crawl text/html files.

This change adds a content types parameter to the gui to enable users to decide whether they want only text/html content scraped or if they want to also include 'application/pdf' files as well. This feature also makes it easier to add other content types in the future if desired.

Additionally I also bumped and tested the pydantic versions as per dependabot as I was here and testing this code anyway.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@bigadsoleiman bigadsoleiman merged commit d3e4336 into aws-samples:main Jun 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants