Skip to content

Latest commit

 

History

History
92 lines (58 loc) · 2.93 KB

TDD-C-202106-UNL-012.md

File metadata and controls

92 lines (58 loc) · 2.93 KB

75 News

75 News dataset contains 15 Turkish news articles from 5 different topics: economics, magazine, health, politics and sports. Dataset is genereated by Kemik Natural Language Processing Group.

The main goal for this dataset is text classification. However, as result of the structure of the given news, it may also be used for sentence tokenization.

Dataset Details

This dataset consists of 15 texts, each of which annotated with a string label denoting the type.

Label Description
ekonomi economics
magazin maganize
sağlık health
siyasi politics
sport sports

Samples

A sample instance is presented below.

Example:

{ 
    'topic': magazin,
    'news': George Clooney CIA ajanı oluyor. GEORGE Clooney, yeni filminde Suriye'de görev yapan eski CIA ajan'ı Robert Bauer'i canlandıracak. İnternetteki "imdb" sitesinin haberine göre, Clooney, Bauer'in "See No Evil: The True Story of a Foot Soldier in the CIA's War on Terror" adlı otobiyografisinden beyazperdeye aktarılacak filmde başrol gösterecek. Filmin yapımcılığını George Clooney ile Steven Soderbergh'in ortak olduğu Section 8 adlı yapım şirketi üstlenecek.'
}

Fields

Explain the fields of the instances.

field dtype
news string
topic integer
file_id integer

Splits

No train/validation/test split is provided by the authors.

Dataset Creation

Curation Rationale

This dataset is motivated by the lack of annotated data on Turkish text classification.

Data Source

The authors gathered the news from Milliyet.

Annotations

Each news article is human-annotated.

Quality

We observed that this dataset does not contain any duplications. Each article is edited by professional editors. The annotations are taken from the publisher's news agency.

Personal and Sensitive Information

All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework.

Considerations

Social Impact of Dataset

This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.

Discussion of Biases

The data included here are from the news. Some of the presented articles may have been disclaimed.

Dataset Curators

Published by M. Fatih AMASYALI.

Citation Information

@article{amasyali2004otomatik,
  title={Otomatik haber metinleri sınıflandırma},
  author={Amasyalı, MF and Yıldırım, T},
  journal={SIU 2004},
  pages={224--226},
  year={2004}
}