75 News dataset contains 15 Turkish news articles from 5 different topics: economics, magazine, health, politics and sports. Dataset is genereated by Kemik Natural Language Processing Group.
The main goal for this dataset is text classification. However, as result of the structure of the given news, it may also be used for sentence tokenization.
This dataset consists of 15 texts, each of which annotated with a string label denoting the type.
Label | Description |
---|---|
ekonomi | economics |
magazin | maganize |
sağlık | health |
siyasi | politics |
sport | sports |
A sample instance is presented below.
Example:
{
'topic': magazin,
'news': George Clooney CIA ajanı oluyor. GEORGE Clooney, yeni filminde Suriye'de görev yapan eski CIA ajan'ı Robert Bauer'i canlandıracak. İnternetteki "imdb" sitesinin haberine göre, Clooney, Bauer'in "See No Evil: The True Story of a Foot Soldier in the CIA's War on Terror" adlı otobiyografisinden beyazperdeye aktarılacak filmde başrol gösterecek. Filmin yapımcılığını George Clooney ile Steven Soderbergh'in ortak olduğu Section 8 adlı yapım şirketi üstlenecek.'
}
Explain the fields of the instances.
field | dtype |
---|---|
news | string |
topic | integer |
file_id | integer |
No train/validation/test split is provided by the authors.
This dataset is motivated by the lack of annotated data on Turkish text classification.
The authors gathered the news from Milliyet.
Each news article is human-annotated.
We observed that this dataset does not contain any duplications. Each article is edited by professional editors. The annotations are taken from the publisher's news agency.
All the news articles presented are already published to the public. Even though some personal information might be presented in the magazine articles, all of the present information is in a legal framework.
This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures.
The data included here are from the news. Some of the presented articles may have been disclaimed.
Published by M. Fatih AMASYALI.
@article{amasyali2004otomatik,
title={Otomatik haber metinleri sınıflandırma},
author={Amasyalı, MF and Yıldırım, T},
journal={SIU 2004},
pages={224--226},
year={2004}
}