Skip to content

"Rossiya Segodnya" news dataset

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.ru
Notifications You must be signed in to change notification settings

RossiyaSegodnya/ria_news_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

"Rossiya Segodnya" news dataset

This repository contains a news dataset presented in the paper:

Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh. Self-Attentive Model for Headline Generation. 41st European Conference on Information Retrieval, 2019. arXiv:1901.07786 [cs.CL]

To download the dataset please use a direct link or clone the repository using git lfs.

Description

Full dataset contains 1003869 Russian language news documents from January, 2010 to December, 2014.

  • ria_20.json contains the first 20 news documents from the dataset.

  • ria_1k.json contains the first 1000 news documents from the dataset.

  • ria.json.gz is full GZip'ed dataset.

Dataset format: each row contains a JSON document that consists of two fields: text is a document body, while title is a news headline.

License

This data is lisensed by Rossiya Segodnya news agency (ria.ru) under CC-BY-ND-NC license. The license text could be accessed here. The Russian version of the same license could be accessed here.

Misc

If you're using the data in a research please consider citing the mentioned paper:

@inproceedings{gavrilov2018self,
	title={Self-Attentive Model for Headline Generation},
	author={Gavrilov, Daniil and  Kalaidin, Pavel and  Malykh, Valentin},
	booktitle={Proceedings of the 41st European Conference on Information Retrieval},
	year={2019}
}