Skip to content

Latest commit

 

History

History
69 lines (64 loc) · 3.06 KB

attribute_guidelines.md

File metadata and controls

69 lines (64 loc) · 3.06 KB

The following document aims to describe which parsable attributes of a parser class represent which semantic piece of a given news article. Consistency between publishers and parsers is a main goal, please report any cases you deem to be inconsistent with this document. If you want to contribute a parser to this library, please ensure that these attributes are named consistently.

NOTE: There are certain utility functions to aid you with parsing. These can be found under fundus/parser/utility.py. We highly recommend using them.

The following table lists Fundus' core attributes and includes the name of the corresponding utility function. Those attributes will be validated with unit tests when used.

NOTE: If you want to bypass validation you can set the validate parameter of the attribute decorator to false.

Attributes table

Name Description Type Utility function
title A string representing the headline of a given article. Does not include subheaders, aims to be as short as possible. Optional[str]
body An object of type `ArticleBody` representing the structural hierarchy of the article content. Optional[ArticleBody] extract_article_body_with_selector
authors A list of strings representing entities related to the creation of the article. We prefer the most precise description out of the provided information. In this context human entities are considered most precise, but we make no promise that any particular string represents a human. Parsers are encouraged to strip strings of additional information besides the name. List[str] generic_author_parsing
publishing_date The earliest release date provided by the publisher. It is not required to be timezone-aware. The date must at least include year, month, day, hours and minutes. Optional[datetime] generic_date_parsing
topics A list of unique strings representing keywords provided by the publisher to describe the article content. Stripping of whitespace etc. is encouraged, but formatting is not. List[str] generic_topic_parsing
free_access A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates that the article cannot be crawled completely. This attribute is implemented by default bool