Boli-corpus-stats-website

Inspite of the fact that people speaking Indian languages like Hindi and Bengali occupy a large percentage of today’s population; these languages are considered low resource with onlythe IITB Hi-En corpus having more than 1 million parallel aligned sentences. And in the largest publicly available multilingual train corpus for Indian languages (as of March 2021) of PIB corpus, most of other pairs were not even crossing one lakh parallel segments. And such less amount of data would not be enough for the data hungry NMT models. So we aimed at filling this gap and improving the results for Indic Machine Translation by walking along the steps of the IITB corpus collection and researching all the different datasets available publicly and create the corpus of Boli.

The website for the corpus is hosted here. The scripts for the creation of the corpus can be found here

By Kaivalya and Vedant, Supervised by Prof Parag Singla.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
static/css		static/css
stats		stats
templates		templates
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
configuration.py		configuration.py
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boli-corpus-stats-website

About

Releases

Packages

Languages

License

Vedant2311/Boli-corpus-stats-website

Folders and files

Latest commit

History

Repository files navigation

Boli-corpus-stats-website

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages