Skip to content

This repository is describes the Indic NLP resources from L3Cube.

Notifications You must be signed in to change notification settings

l3cube-pune/indic-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

L3Cube-IndicNLP

The L3Cube's IndicNLP project is an effort to improve NLP resources for Indic languages. We have created monolingual BERT models for 10 Indic languages. We have also released monolingual and multilingual (cross-lingual) Sentence BERT models. These models provide state-of-the-art results on downstream tasks.

Monolingual BERT models for Indic languages

More details about these models can be found in paper

Model Link
Marathi BERT model
Hindi BERT model
Dev BERT (Hindi + Marathi) model
Kannada BERT model
Telugu BERT model
Malayalam BERT model
Tamil BERT model
Gujarati BERT model
Oriya BERT model
Bengali BERT model
Punjabi BERT model
Assamese BERT model

Indic Sentence BERT models

More details about these models can be found in paper

Similarity Model Sentence BERT
Marathi Similarity Marathi SBERT
Hindi Similarity Hindi SBERT
Kannada Similarity Kannada SBERT 
Telugu Similarity Telugu SBERT
Malayalam Similarity Malayalam SBERT
Tamil Similarity Tamil SBERT
Gujarati Similarity Gujarati SBERT
Oriya Similarity Oriya SBERT
Bengali Similarity Bengali SBERT
Punjabi Similarity Punjabi SBERT
Indic Similarity (multilingual) Indic SBERT (multilingual)

License

All the resources are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The datasets are released to the community for research purposes only and the group is not responsible for any misuse of these datasets.

Citing

@article{joshi2022l3cube_hind,
  title={L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2211.11418},
  year={2022}
}
@article{deode2023l3cube,
  title={L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT},
  author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2304.11434},
  year={2023}
}

Publications

Joshi, Raviraj. "L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages." arXiv preprint arXiv:2211.11418 (2022).
Deode, Samruddhi, et al. "L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT." arXiv preprint arXiv:2304.11434 (2023).
Mirashi, Aishwarya, et al. "L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages." arXiv preprint arXiv:2401.02254 (2024).

This project is led by Raviraj Joshi under L3Cube Labs, Pune. For any queries contact ravirajoshi@gmail.com .

About

This repository is describes the Indic NLP resources from L3Cube.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published