Web Scraper and Cache #4565
BradKML
started this conversation in
Suggestion
Replies: 2 comments 2 replies
-
I think creating RAG out of a website automatically would not be a perfect way to get maximum knowledge accuracy. Each site is different and manual job is still required to remove noises that we do not want into the knowledge. |
Beta Was this translation helpful? Give feedback.
1 reply
-
With Firecraw? ln mendable they can scrap entire sites with easeness |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Let's say I want to create a dataset out a famous blog, the current solution is to use WGET, or scraper libraries like Scrapy, Selenium, Puppeteer, or Playwright.
Also for extracting articles currently Trafilatura is taking the lead (see also Readability and ExtractNet) https://github.com/adbar/trafilatura https://github.com/scrapinghub/article-extraction-benchmark
Can this whole function be turned into its own block such that RAG dataset creation can be less cumbersome?
Beta Was this translation helpful? Give feedback.
All reactions