Skip to content

🤖 Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.

License

Notifications You must be signed in to change notification settings

Marthym/scraphead

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraphead GitHub license

Quality Gate Status Coverage Maintainability Rating

Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.

Description

Scraphead was divided into core and netty. The core contains all the logic, the HTML head parsing and the mapping into OpenGraph and Twitter Card model. The netty was one of the multiple possible implementations for the web client.

Main features

  • non blocking
  • download only the <head/>, not the entire HTML file
  • Multiple web client implementation available
  • Detect file encoding
  • Read OpenGraph and Twitter Card, and more
  • Allow plugins for specific treatment (depending on domain for example)
  • build for Java 17 and modules

Installation

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-core</artifactId>
    <version>${scraphead.version}</version>
</dependency>

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-netty</artifactId>
    <version>${scraphead.version}</version>
</dependency>

Usage

With all collectors :

ScrapClient scrapHttpClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapHttpClient).build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
    .map(doWhatEverYouWantWithMeta)
    .subscribe();

With limited collectors' usage :

ScrapClient scrapClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapClient)
  .useMetaTitleAndDescr()
  .useOpengraph()
  .build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
  .map(doWhatEverYouWantWithMeta)
  .subscribe();

About

🤖 Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.

Topics

Resources

License

Stars

Watchers

Forks