Skip to content

claimskg/claim_topics_dataset

Repository files navigation

claim_topics_dataset

This repository contains the claims topic dataset extracted from ClaimsKG. You will find the procedure to extract the topic sample from ClaimsKG, transform the samples into the individual annotator files and to reconcile divergences between annotators to produce the final dataset.

We provide the original annotations of each annotator and the reconciled gold standard dataset.

Extraction from ClaimsKG

Normalization of keywords with thesauri

The entity linking of claims, headlines, reviews with DBPedia entities is performed in ClaimsKG (with TagMe) during the extraction step from fact-checking sites. Although this provides some degree of normalization, the coverage is fairly low and the entities aren't of a consistent granularity, which is why an additional normalization step is performed through the annotation with high-level social sciences thesauri (TheSoz and the UNESCO thesaurus). This allows to overlay a concept hierarchy on-top of the keywords in order to distinguish between higher and lower-level concepts and to normalize similar keywords into single entities.

The annotation is performed with a dictionary matching approach, insensitive to minor surface morphological variation and word order for compounds that is very similar to that of MGREP used in the NCBO Bioportal Annotator. The implementation of this reconciliation procedure is integrated directly in the upstream ClaimsKG generator program and is enabled with the

We provide a turtle version of this normalized ClaimsKG, which can be used to reproduce the dataset. The easiest way to load the dataset and to run queries against it, is to use the Virtuoso docker image and to place the turtle file in the toLoaddirectory of mounted base data directory, as described here: https://hub.docker.com/r/tenforce/virtuoso/.

Target Concepts

On the basis of topic_counts.html, we selected the top-level concepts from the Thesauri that were the most frequent. When there were overlapping concepts, the concept from the TheSoz thesaurus was preferred.

The concepts retained are the following:

For each concept, we can extract the all corresponding claims with all associated meta-information by using a SPARQL query. The query filters some non-thematic keywords with explicit regular expressions, retains only english-language claims and excludes AfrikaCheck claims, where the keywords are very noisy. You can see the query below

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX thesoz: <http://lod.gesis.org/thesoz/>
PREFIX unesco: <http://vocabularies.unesco.org/thesaurus/>
PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?claim ?review_url str(?claim_author_name) as ?claimer ?text str(?headline) as ?headline ?keywords  ?claim_date WHERE 
{
  {
    SELECT ?claim (group_concat(?kwlabel, ',') as ?keywords)  WHERE {
      ?claim schema:keywords ?keyword.
      ?keyword schema:name ?kwlabel.
       FILTER (!regex(str(?kwlabel),"immigration|education|economy|taxes|health care|Public Health|ASP Article","i"))

   } GROUP BY ?claim} 

  ?claim schema:keywords ?keyword.
  ?keyword dct:about ?kwc.
  ?keyword schema:name ?kwlabel.
  ?kwc skos:prefLabel ?kwcl_r.

  ?claim schema:text ?text_r.

  ?claim schema:author ?claim_author.
  ?claim_author schema:name ?claim_author_name.
  ?claim schema:datePublished ?claim_date.

  ?cr schema:itemReviewed ?claim.
  ?cr schema:author ?author.
  ?cr schema:headline ?headline.
  ?cr schema:url ?review_url

  BIND(str(?text_r) as ?text)
  FILTER (lang(?kwcl_r) = 'en')
  FILTER (regex(str(?kwc), "URI_OF_THE_CONCEPT","i"))
FILTER(!regex(str(?author), "http://data.gesis.org/claimskg/organization/africacheck"))
}

We ran this query through the virtuoso SPARQL interface to get the results as a TSV file for each concept. The set of all seven TSV files was used as the basis for the sampling and the generation of the files for annotation. You may find said files in the extracted_claims directory.

Annotation Protocol

The annotation files for each annotator are provided in individual_annotations. Each CSV annotation file was annotated in a local spreadsheet program, by putting any symbol in the column for each relevant concept.

The annotators were asked to use only the information present in the file as much as possible (claim, headline, author, date) and to use a search engine if they were unfamiliar with particular entities or acronyms. The keywords pertaining to the target topics were removed, but the other keywords were left in place as they could provide useful information.

For each concept, detailed guidelines were provided, with positive and negative examples. You may find a few examples below.

elections

This tag should be assigned if a claim deals with an ongoing or past election or the election system. It should not be assigned if the claim was only uttered in the context of an election, even if its implicit meaning is related to an election, e.g. when a candidate makes a statement about their opponent to belittle them or a candidate presents their campaign pledge.

Positive examples:

Negative examples:

taxes

This tag should be assigned if a claim is about taxes directly. If a concept is mentioned that can be related to taxes (i.e. government spending) but the connection to taxes is background knowledge rather than connected to the message of the claim, this tag should not be assigned.

Positive examples:

Negative examples:

healthcare

This tag should be assigned to claims dealing with the healthcare system or health issues in general.

Positive examples:

Negative examples:

Agreement and reconciliation

Krippendorff’s α (Masi distance) All annotators: 0.75

Pairwise agreement:

A1 A2 0.84

A1 A3 0.66

A1 A4 0.81

A1 A5 0.78

A2 A3 0.67

A2 A4 0.85

A2 A5 0.81

A3 A4 0.67

A3 A5 0.65

A4 A5 0.78

The reconciliation script will be made available in the final version.

The final reconciled dataset can be found in gold_updated.csv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages