The following document aims to describe which parsable attributes of a parser class represent which semantic piece of a given news article. Consistency between publishers and parsers is a main goal, please report any cases you deem to be inconsistent with this document. If you want to contribute a parser to this library, please ensure that these attributes are named consistently.
NOTE: There are certain utility functions to aid you with parsing.
These can be found under fundus/parser/utility.py
.
We highly recommend using them.
The following table lists Fundus' core attributes and includes the name of the corresponding utility function. Those attributes will be validated with unit tests when used.
NOTE: If you want to bypass validation you can set the validate
parameter of the attribute
decorator to false.
Name | Description | Type | Utility function |
---|---|---|---|
title | A string representing the headline of a given article. Does not include subheaders, aims to be as short as possible. | Optional[str] |
|
body | An object of type `ArticleBody` representing the structural hierarchy of the article content. | Optional[ArticleBody] |
extract_article_body_with_selector |
authors | A list of strings representing entities related to the creation of the article. We prefer the most precise description out of the provided information. In this context human entities are considered most precise, but we make no promise that any particular string represents a human. Parsers are encouraged to strip strings of additional information besides the name. | List[str] |
generic_author_parsing |
publishing_date | The earliest release date provided by the publisher. It is not required to be timezone-aware. The date must at least include year, month, day, hours and minutes. | Optional[datetime] |
generic_date_parsing |
topics | A list of unique strings representing keywords provided by the publisher to describe the article content. Stripping of whitespace etc. is encouraged, but formatting is not. | List[str] |
generic_topic_parsing |
free_access | A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates that the article cannot be crawled completely. This attribute is implemented by default | bool |