TasvirET

This dataset consists of Turkish captions for Flickr8 Dataset. The dataset is collected to be used in automatically describing images for Turkish. One or multiple captions for each image are gathered and tokenized. The original page of the paper can be found at TasvirET Webpage.

Dataset Details

Annotations for this dataset is collected by an online platform created by the dataset creators. This dataset consists 12222 captures for 8000 images. Each sample has image id, split information, filename, and one or more raw texts, and tokenized texts.

Samples

{
"split":"train",
"filename":"2513260012_03d33305cf.jpg",
"imgid":0,
"sentences":
    [{"sentid":0,
        "imgid":0,
        "raw": "Sarı ve siyah tüylü iki köpek karda koşuyor.",
        "tokens": ["Sarı","ve","siyah","tüylü","iki","köpek","karda","koşuyor"]},
    {"sentid":1,
        "imgid":0,
        "raw":"Kar üstünde koşan iki köpek.",
        "tokens": ["Kar","üstünde","koşan","iki","köpek"]}],
"sentids":[0,1]
}

Fields

field	dtype
split	string
filename	string
imgid	integer
sentences	dict list
sentid	integer
raw	string
tokens	string list
sentids	integer list

Splits

Indicated train/validation/test split sizes.

Training	Validation	Test
6000	1000	1000

Dataset Creation

Curation Rationale

Image captionin datasets are available for mostly English. Purpose of creating this dataset is to propose firts Turkish Image Captioning dataset and prompt researchers to work on this area. Therefore, the dataset is created to be used in studies about automatically generating Turkish descriptions for images.

Data Source

Images are taken from Flickr8 dataset.

Annotations

Annotations are collected from Turkish Native speakers using an online platform. This platform is created by the dataset creators. Annotators signed to this platform are examined with a qualification exam. Then they are able to create captions for images or assess the quality of the captions.

Discussion of Biases

The captions included here are crowd-sourced. Some captions might be misleading.

Additional Information

Dataset Curators

Published by M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. Ikizler Cinbis and R. Cakici.

Citation Information

Please cite the following paper if you found this dataset useful:

M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. Ikizler Cinbis and R. Cakici. TasvirEt: Görüntülerden Otomatik Türkçe Açıklama Oluşturma İçin Bir Denektaşı Veri Kümesi (TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images). 24. IEEE Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SIU 2016), Zonguldak, May 2016

@inproceedings{unaltasviret,
title={TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images},
author={Unal, Mesut Erhan and Citamak, Begum and Yagcioglu, Semih and Erdem, Aykut and Erdem, Erkut and Ikizler Cinbis, Nazli and Cakici, Ruket},
booktitle={Signal Processing and Communications Applications Conference (SIU), 2016 24th},
pages=,
year=,
organization=}

Uploaded and documented by Arda Goktogan: ardagoktogan gmail com and Burak Kizil: mkizil19 ku edu tr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD-C-202106-CC-001.md

TDD-C-202106-CC-001.md

TasvirET

Dataset Details

Samples

Fields

Splits

Dataset Creation

Curation Rationale

Data Source

Annotations

Discussion of Biases

Additional Information

Dataset Curators

Citation Information

Files

TDD-C-202106-CC-001.md

Latest commit

History

TDD-C-202106-CC-001.md

File metadata and controls

TasvirET

Dataset Details

Samples

Fields

Splits

Dataset Creation

Curation Rationale

Data Source

Annotations

Discussion of Biases

Additional Information

Dataset Curators

Citation Information