Skip to content

iambusra/CNLI-TR_augmentation

Repository files navigation

CNLI-TR Augmentation Pipeline

Summary

This repository contains text augmentation pipeline for the seed data of CNLI-TR.

Introduction

CNLI-TR is a challenge dataset in Turkish created to assess natural language inference (NLI) abilities of language models. It contains sentence triplets: One potentially De Re De Dicto ambiguous sentence, one De Re paraphrase, and one De Dicto paraphrase.

The entirety of seed data is manually generated by trained linguists who are native speakers of Turkish.

The code and resources in this repository is used to augment the seed data to create a large, manually corrected NLI dataset.

Contents

List of Turkish given names (names.csv): This .csv file contains XX Turkish proper names scraped from web1. Gender2 and origin3 of each name is indicated in the corresponding column.

Turkish intensional operators list (): This .csv file contains XXX Turkish intensional operators4.

Seed data (annotated version): Seed data is manually generated by trained linguists who are native speakers of Turkish. It consists of sentence triplets: One potentially De Re De Dicto ambiguous sentence, one De Re paraphrase, and one De Dicto paraphrase.

Unique sentence id generator (id_generator.py): Each sentence in seed data set and augmented data set has a unique alphanumeric identifier. Sentence IDs consist of three letters followed by an underscore and a 5-digit number. The initial letter in IDs of seed data indicates the contributor that wrote the sentence. An algorithm that generates random numbers and strings was used to create these sentence IDs.

Augmentation pipeline (): The augmentation pipeline uses seed data to generate sentence triplets. Details will be revealed soon.

=== Machine-readable metadata ================================
Data available since: 11.2022
License: CC BY-SA 4.0
Includes text: yes
Contributors: Marşan, Büşra; Atlamaz, Ümit; Demirok, Ömer; Kuzgun, Aslı; 
Oksal, Ceren; Doğan, Merve; Gök, Serra; Korkmaz, Arda
Contact: busra.marsan@boun.edu.tr 
===============================================================================

Footnotes

  1. https://isimbulamadim.com/

  2. "f" for feminine, "m" for masculine, "u" for unisex.

  3. "ar" for Arabic, "ge" for Georgian, "gr" for Greek, "hb" for Hebrew, "mg" for Mongolian, "pr" for Persian, "tr" for Turkish, and "n/a" for unknown origins. Please note that some names were recorded as having two origins, i.e. ar-tr.

  4. intensional operator: Any expression O that combines with sentences φ to form well-formed expressions (usually sentences) Oφ and whose extension [[O]][^M,i] at an index i in a model M takes sentential intensions, i.e. functions from indices to truth values. (Wehmeier, K. F. (2018). Are quantifiers intensional operators?. Inquiry.)

About

Augmentation pipeline for seed data of CNLI-TR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages