Skip to content

Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STS English dataset.

Notifications You must be signed in to change notification settings

emrecncelik/sts-benchmark-tr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STSb Turkish

Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STSb English dataset. This dataset is not reviewed by expert human translators. Also available in HuggingFace Datasets.

Download

From the repository

import io
import requests
import pandas as pd

STS_URLS = {
    "train": "https://raw.githubusercontent.com/emrecncelik/sts-benchmark-tr/main/sts-train-tr.csv",
    "dev": "https://raw.githubusercontent.com/emrecncelik/sts-benchmark-tr/main/sts-dev-tr.csv",
    "test": "https://raw.githubusercontent.com/emrecncelik/sts-benchmark-tr/main/sts-test-tr.csv",
}

def get_github_dataset(dataset_url: str):
    dataset_file = requests.get(dataset_url).content
    dataset = pd.read_csv(io.StringIO(dataset_file.decode("utf-8")))
    return dataset
from datasets import load_dataset

dataset = load_dataset("emrecan/stsb-mt-turkish")

About

Semantic textual similarity dataset for the Turkish language. It is a machine translation (Azure) of the STS English dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages