Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal]: use case for SG AI #524

Open
DiTo97 opened this issue May 18, 2024 · 1 comment
Open

[Proposal]: use case for SG AI #524

DiTo97 opened this issue May 18, 2024 · 1 comment
Labels
proposal You want to address a specific problem? Let us know about your idea.

Comments

@DiTo97
Copy link

DiTo97 commented May 18, 2024

Problem statement

A lot of manual work and tuning goes into every single publisher that's currently maintained, and still requires constant monitoring if anything changes in the supported news outlets or web sources.

Solution

replace manual and labour-intensive scraping code with SG AI, whose you-only-scrape-once (YOSO) concept serves that purpose specifically: you write the scraping pipeline once, and leverage powerful LLMs (open-source or closed-source) to extract the articles in the desired format regardless of the web source or its HTML code changing over time.

write a single smart scraper graph tailored for news and articles crawling in the desired relational format, common to all available publishers and outlets.

Draft

Open Questions

No response

@DiTo97 DiTo97 added the proposal You want to address a specific problem? Let us know about your idea. label May 18, 2024
@DiTo97 DiTo97 changed the title [Proposal]: perfect use case for SG AI [Proposal]: use case for SG AI May 18, 2024
@MaxDall
Copy link
Collaborator

MaxDall commented May 20, 2024

Hey @DiTo97 thanks for the proposal :)

Fundus uses manual-written heuristics to optimize for accuracy and recall. Our library aims to yield artifact-free extractions for every supported publisher. I will give SG AI a shot and see the results it scores on our benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal You want to address a specific problem? Let us know about your idea.
Projects
None yet
Development

No branches or pull requests

2 participants