Skip to content

cisnlp/GlotStoryBook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

GlotStoryBook Corpus

Story Books for 180 ISO-639-3 codes.

Usage (HF Loader)

The data is available on HuggingFace (HF) at: https://huggingface.co/datasets/cis-lmu/GlotStoryBook.

from datasets import load_dataset
dataset = load_dataset('cis-lmu/GlotStoryBook')
print(dataset['train'][0]) # First row data

Download

If you are not a fan of the HF dataloader, download it directly:

! wget https://huggingface.co/datasets/cis-lmu/GlotStoryBook/resolve/main/GlotStoryBook.csv

License and Copyright

We do not own any of the text from which these data has been extracted. All the files are collected from the repository located at https://github.com/global-asp/. The source repository for each text and file is stored in the dataset. Each file in the dataset is associated with one license from the CC family. The licenses include 'CC BY', 'CC BY-NC', 'CC BY-NC-SA', 'CC-BY', 'CC-BY-NC', and 'Public Domain'. We also license the code, actual packaging and the metadata of these data under the cc0-1.0.

Sources

global-asp, asp-source, lcb-source, pb-source, sbc-source, gasp-mexico, global-pb, global-lcb, sbjm-source, sbug-source, sbno-source, sbk-source, sbuk-source, lida-source, asp-raw-db, global-lida, gasp-alternates, asp-new

Citation

If you use any part of this code and data in your research, please cite it (along with https://github.com/global-asp/) using the following BibTeX entry. This work is part of the GlotLID project.

@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}