75 News

75 News dataset contains 15 Turkish news articles from 5 different topics: economics, magazine, health, politics and sports. Dataset is genereated by Kemik Natural Language Processing Group.

The main goal for this dataset is text classification. However, as result of the structure of the given news, it may also be used for sentence tokenization.

Dataset Details

This dataset consists of 15 texts, each of which annotated with a string label denoting the type.

Label	Description
ekonomi	economics
magazin	maganize
sağlık	health
siyasi	politics
sport	sports

Samples

A sample instance is presented below.

Example:

{ 
    'topic': magazin,
    'news': George Clooney CIA ajanı oluyor. GEORGE Clooney, yeni filminde Suriye'de görev yapan eski CIA ajan'ı Robert Bauer'i canlandıracak. İnternetteki "imdb" sitesinin haberine göre, Clooney, Bauer'in "See No Evil: The True Story of a Foot Soldier in the CIA's War on Terror" adlı otobiyografisinden beyazperdeye aktarılacak filmde başrol gösterecek. Filmin yapımcılığını George Clooney ile Steven Soderbergh'in ortak olduğu Section 8 adlı yapım şirketi üstlenecek.'
}

Fields

Explain the fields of the instances.

field	dtype
news	string
topic	integer
file_id	integer

Splits

No train/validation/test split is provided by the authors.

Dataset Creation

Curation Rationale

This dataset is motivated by the lack of annotated data on Turkish text classification.

Data Source

The authors gathered the news from Milliyet.

Annotations

Each news article is human-annotated.

Quality

We observed that this dataset does not contain any duplications. Each article is edited by professional editors. The annotations are taken from the publisher's news agency.

Personal and Sensitive Information

All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework.

Considerations

Social Impact of Dataset

This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.

Discussion of Biases

The data included here are from the news. Some of the presented articles may have been disclaimed.

Dataset Curators

Published by M. Fatih AMASYALI.

Citation Information

@article{amasyali2004otomatik,
  title={Otomatik haber metinleri sınıflandırma},
  author={Amasyalı, MF and Yıldırım, T},
  journal={SIU 2004},
  pages={224--226},
  year={2004}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD-C-202106-UNL-012.md

TDD-C-202106-UNL-012.md

75 News

Dataset Details

Samples

Fields

Splits

Dataset Creation

Curation Rationale

Data Source

Annotations

Quality

Personal and Sensitive Information

Considerations

Social Impact of Dataset

Discussion of Biases

Dataset Curators

Citation Information

Files

TDD-C-202106-UNL-012.md

Latest commit

History

TDD-C-202106-UNL-012.md

File metadata and controls

75 News

Dataset Details

Samples

Fields

Splits

Dataset Creation

Curation Rationale

Data Source

Annotations

Quality

Personal and Sensitive Information

Considerations

Social Impact of Dataset

Discussion of Biases

Dataset Curators

Citation Information