Skip to content

Scripts that were used to creative an interactive website displaying the stats for the Indic multilingual train corpus - Boli, developed by us

License

Notifications You must be signed in to change notification settings

Vedant2311/Boli-corpus-stats-website

 
 

Repository files navigation

Boli-corpus-stats-website

Inspite of the fact that people speaking Indian languages like Hindi and Bengali occupy a large percentage of today’s population; these languages are considered low resource with onlythe IITB Hi-En corpus having more than 1 million parallel aligned sentences. And in the largest publicly available multilingual train corpus for Indian languages (as of March 2021) of PIB corpus, most of other pairs were not even crossing one lakh parallel segments. And such less amount of data would not be enough for the data hungry NMT models. So we aimed at filling this gap and improving the results for Indic Machine Translation by walking along the steps of the IITB corpus collection and researching all the different datasets available publicly and create the corpus of Boli.

The website for the corpus is hosted here. The scripts for the creation of the corpus can be found here

By Kaivalya and Vedant, Supervised by Prof Parag Singla.

About

Scripts that were used to creative an interactive website displaying the stats for the Indic multilingual train corpus - Boli, developed by us

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 55.0%
  • Python 41.1%
  • CSS 3.4%
  • Shell 0.5%