Skip to content

A PyTorch Lightning Implementation of Multi-Language Identification using a SentenceTransformer model pre-trained on English. Work done while interning at ByteFuse.

Notifications You must be signed in to change notification settings

kayodeolaleye/multilang-identification

Repository files navigation

Fine-tuning Sentence-Transformers for Multi-class Language Identification Task with PyTorch Lightning

This repository is for language identification using SentenceTransformer, a pre-trained transformer-based model for natural language processing.

A list of SentenceTransformer pre-trained models can be found here

I specifically used the task-agnotic (English) pre-trained SentenceTransformer model to extract features from 100 documents per language and trained a single linear classifier on the extracted features.

Caption

Figure: Architecture for the approach. A pre-trained SentenceTransformer transforms the documents and the embeddings are used to train a single linear classifier.

Python version: Python 3.10.8

Train from Scratch

  1. Clone the repo:
git clone kayodeolaleye/multilang-identification
cd ./multilang-identification
  1. Install requirements:
pip install -r requirements.txt
  1. Train the model:
python training.py --model_name all-MiniLM-L6-v2 --epochs 1000 --batch_size 32

Embeddings from pre-trained models: all-mini-LM-L6-v2 and all-mini-LM-L12-v2 respectively

Learning curves for the single Linear Classifier

Performance on test set

ToDo: Example Usage

Open In Colab

Add code snippets for loading the model weights and assessing performance on test samples in Google Colab

References

  1. MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
  2. SentenceTransformers Pre-trained Models
  3. Language Identification Dataset

About

A PyTorch Lightning Implementation of Multi-Language Identification using a SentenceTransformer model pre-trained on English. Work done while interning at ByteFuse.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published