Nan.ai English-Filipino Corpus Open Data

Nan.ai English-Filipino Corpus is an open source initiative to create an open data repository for Filipino-English corpus intended to train cross-lingual natural language models for chatbots and topic models, among others. This corpus is currently maintained to train Nan.ai support, the chatbot that serves as customer support for Nan.ai users. Supported by UNICEF Innovations, this open source initiative aims to start collaboration on enriching low resource languages, such as Filipino, and jumpstart applications on NLP.

You can participate by (1) contributing language data or (2) annotating existing datasets. We also welcome computing and linguistics experts to improve this repository's usability for various use cases.

To explore our datasets, you can use the existing NLP notebooks available here or import data by following the instructions here.

Description of the data

Language data is collected from public domain and stored as text files. These text files are grouped according to use i.e. spoken or written texts and labeled by source (e.g. reportage, conversation). Available data in this repository are annotated and anonymized (if any PII is part of the dataset).

We are also creating datasets derived and annotated based on the corpus data such as stoplists, labeled sentiments, and domain-specific dictionaries.

Alongside our open data initiative, we are also open sourcing a related machine learning service, NAN.ai Natural Language Understanding (NLU).

Navigate this project

Resources

License

nanai-opendata-corpus is licensed under the Creative Commons Zero v1.0 Universal

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
sample data		sample data
sample notebooks		sample notebooks
.DS_Store		.DS_Store
CODEOFCODUCT.md		CODEOFCODUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATAJOURNAL.md		DATAJOURNAL.md
HOWTO.md		HOWTO.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nan.ai English-Filipino Corpus Open Data

Description of the data

Navigate this project

Resources

License

About

Releases

Packages

Contributors 4

License

Saphron-Asia/nan.ai-opendata-corpus

Folders and files

Latest commit

History

Repository files navigation

Nan.ai English-Filipino Corpus Open Data

Description of the data

Navigate this project

Resources

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages