claimreview-data

This repository contains a dataset of claims and their corresponding fact-checks.

The data is automatically updated every day with ClaimReview.

Collection

The data collection is performed in 6 steps:

Collection of ClaimReviews URLs Candidates: using DataCommons and Google Fact-Check API, we get all the URLs where ClaimReviews are published
Collection of ClaimReviews from fact-checkers: we recollect from the websites of fact-checkers the ClaimReviews
Validation and Cleaning: we fix and clean the metadata
Ratings Mapping: we normalise the labels to credible, mostly credible, uncertain, mostly non-credible, non-credible and unknown
Occurrences Extraction and Unshortening: we extract the URLs where the claims occur and we resolve the links that use shortening services (e.g., bit.ly) or archives (e.g., archive.is)
Misinformation Database and Snapshot: we build the output files described below

The process of collection is run by the claimreview_collector_full from the MisinfoMe project.

Each archive contains the following files:

ifcn_sources.json: details of the fact-checkers present in the dataset, such as website, country and language. The details also include the details of the IFCN compliance: date of issue, expiration, adherence to each of the skills (e.g. transparency of sourcing, transparency of methodology).
claim_reviews_raw.json: this file contains the ClaimReviews collected in the first step from DataCommons and Google Fact-Check Tools. As noted before, this is bigger than the final cleaned dataset (recollection issues) but contains uncleaned data (appearance and firstAppearance fields).
claim_reviews_recollected.json: the recollected dataset.
claim_reviews.json: the final cleaned dataset with ratings mapping and unshortening.
claim_labels_mapping.json: statistics on how labels have been translated.
disagreeing_reviews.json: cases where the same URL has received multiple disagreeing ratings.
not_ifcn_sources.json: this file contains a list of domains that published ClaimReview but are not in the IFCN list.
links_all_full.json: details of the reviewed URLs. For each URL that has been reviewed, we show the reviews and the normalised ratings.
stats.json: statistics of data collection.

This data is currently used by MisinfoMe.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
LICENSE		LICENSE
README.md		README.md