A Scala library to extract content from an article HTML: title, full text, favicon, image, etc.
This project is a scala port of Mozilla's Readability.js with a few tweaks and improvements. Scala version is 2.12.
Import the project with Maven as follows:
<dependency>
<groupId>com.github.ghostdogpr</groupId>
<artifactId>readability4s</artifactId>
<version>1.0.9</version>
</dependency>
To parse a document, you must create a new Readability
object from a URI string and an HTML string, and then call parse()
. Here's an example:
val article = Readability(url, htmlString).parse()
It returns an Option[Article]
.
It is either None
when the article could not be processed, or an Article
with the following properties:
uri
: original URI string that was passed to constructortitle
: article titlebyline
: author metadatacontent
: HTML string of processed article contenttextContent
: text of processed article contentlength
: length of article, in charactersexcerpt
: article description, or short excerpt from contentfaviconUrl
: URL of the favicon imageimageUrl
: URL of an image representing the article