Web Scraper and Cache #4565

BradKML · 2024-05-22T02:11:57Z

BradKML
May 22, 2024

Let's say I want to create a dataset out a famous blog, the current solution is to use WGET, or scraper libraries like Scrapy, Selenium, Puppeteer, or Playwright.
Also for extracting articles currently Trafilatura is taking the lead (see also Readability and ExtractNet) https://github.com/adbar/trafilatura https://github.com/scrapinghub/article-extraction-benchmark
Can this whole function be turned into its own block such that RAG dataset creation can be less cumbersome?

patryk20120 · 2024-05-23T10:24:49Z

patryk20120
May 23, 2024

I think creating RAG out of a website automatically would not be a perfect way to get maximum knowledge accuracy. Each site is different and manual job is still required to remove noises that we do not want into the knowledge.

1 reply

BradKML May 24, 2024
Author

Let's assume that the blog in question is written as cleanly as a book (and that most pages are exclusively either articles or navigation), would there be a way to create a partially human-assisted cleaning pipeline for refining the RAG? Style noise could be easily handled through Trafilatura and others, but article selection could be done by humans with LLM assistance.
Also might be a good idea to have agents representing different blogs to debate each other, not necessary focus on knowledge accuracy but expert opinion accuracy. And it would be good to find documents that are related to each other and create a graph or sorts based on topic similarity.
Also noting this concept with other WGet-esque software like ArchiveBox ArchiveBox/ArchiveBox#191

scepeda78 · 2024-05-27T01:56:33Z

scepeda78
May 27, 2024

With Firecraw? ln mendable they can scrap entire sites with easeness

1 reply

BradKML May 27, 2024
Author

Thanks for the note but currently it is not self-host ready with API-free options. https://github.com/mendableai/firecrawl
Thinking of alternatives like MarkDown Crawler that can download a blog without roadblock and convert them into usable data to RAG https://github.com/paulpierre/markdown-crawler https://github.com/VinciGit00/Scrapegraph-ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Scraper and Cache #4565

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Web Scraper and Cache #4565

BradKML May 22, 2024

Replies: 2 comments · 2 replies

patryk20120 May 23, 2024

BradKML May 24, 2024 Author

scepeda78 May 27, 2024

BradKML May 27, 2024 Author

BradKML
May 22, 2024

Replies: 2 comments 2 replies

patryk20120
May 23, 2024

BradKML May 24, 2024
Author

scepeda78
May 27, 2024

BradKML May 27, 2024
Author