Skip to content

Latest commit

 

History

History
101 lines (63 loc) · 3.55 KB

TDD-C-202106-CC-001.md

File metadata and controls

101 lines (63 loc) · 3.55 KB

TasvirET

This dataset consists of Turkish captions for Flickr8 Dataset. The dataset is collected to be used in automatically describing images for Turkish. One or multiple captions for each image are gathered and tokenized. The original page of the paper can be found at TasvirET Webpage.

Dataset Details

Annotations for this dataset is collected by an online platform created by the dataset creators. This dataset consists 12222 captures for 8000 images. Each sample has image id, split information, filename, and one or more raw texts, and tokenized texts.

Samples

{
"split":"train",
"filename":"2513260012_03d33305cf.jpg",
"imgid":0,
"sentences":
    [{"sentid":0,
        "imgid":0,
        "raw": "Sarı ve siyah tüylü iki köpek karda koşuyor.",
        "tokens": ["Sarı","ve","siyah","tüylü","iki","köpek","karda","koşuyor"]},
    {"sentid":1,
        "imgid":0,
        "raw":"Kar üstünde koşan iki köpek.",
        "tokens": ["Kar","üstünde","koşan","iki","köpek"]}],
"sentids":[0,1]
}

Fields

field dtype
split string
filename string
imgid integer
sentences dict list
sentid integer
raw string
tokens string list
sentids integer list

Splits

Indicated train/validation/test split sizes.

Training Validation Test
6000 1000 1000

Dataset Creation

Curation Rationale

Image captionin datasets are available for mostly English. Purpose of creating this dataset is to propose firts Turkish Image Captioning dataset and prompt researchers to work on this area. Therefore, the dataset is created to be used in studies about automatically generating Turkish descriptions for images.

Data Source

Images are taken from Flickr8 dataset.

Annotations

Annotations are collected from Turkish Native speakers using an online platform. This platform is created by the dataset creators. Annotators signed to this platform are examined with a qualification exam. Then they are able to create captions for images or assess the quality of the captions.

Discussion of Biases

The captions included here are crowd-sourced. Some captions might be misleading.

Additional Information

Dataset Curators

Published by M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. Ikizler Cinbis and R. Cakici.

Citation Information

Please cite the following paper if you found this dataset useful:

M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. Ikizler Cinbis and R. Cakici. TasvirEt: Görüntülerden Otomatik Türkçe Açıklama Oluşturma İçin Bir Denektaşı Veri Kümesi (TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images). 24. IEEE Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SIU 2016), Zonguldak, May 2016

@inproceedings{unaltasviret,
title={TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images},
author={Unal, Mesut Erhan and Citamak, Begum and Yagcioglu, Semih and Erdem, Aykut and Erdem, Erkut and Ikizler Cinbis, Nazli and Cakici, Ruket},
booktitle={Signal Processing and Communications Applications Conference (SIU), 2016 24th},
pages=,
year=,
organization=}

Uploaded and documented by Arda Goktogan: ardagoktogan gmail com and Burak Kizil: mkizil19 ku edu tr