Nan.ai English-Filipino Corpus is an open source initiative to create an open data repository for Filipino-English corpus intended to train cross-lingual natural language models for chatbots and topic models, among others. This corpus is currently maintained to train Nan.ai support, the chatbot that serves as customer support for Nan.ai users. Supported by UNICEF Innovations, this open source initiative aims to start collaboration on enriching low resource languages, such as Filipino, and jumpstart applications on NLP.
You can participate by (1) contributing language data or (2) annotating existing datasets. We also welcome computing and linguistics experts to improve this repository's usability for various use cases.
To explore our datasets, you can use the existing NLP notebooks available here
or import data by following the instructions here
.
Language data is collected from public domain and stored as text files. These text files are grouped according to use i.e. spoken or written texts and labeled by source (e.g. reportage, conversation). Available data in this repository are annotated and anonymized (if any PII is part of the dataset).
We are also creating datasets derived and annotated based on the corpus data such as stoplists, labeled sentiments, and domain-specific dictionaries.
Alongside our open data initiative, we are also open sourcing a related machine learning service, NAN.ai Natural Language Understanding (NLU).
- Documentation
- Issue tracking
- Discussion board
- How to contribute data
nanai-opendata-corpus is licensed under the Creative Commons Zero v1.0 Universal