Skip to content

A machine translation library utilizing m2m100 models, equipped with features for generating diverse verb variants via VerbNet and Conditional Beam Search to enrich Virtual Assistant training sets.

License

Notifications You must be signed in to change notification settings

cartesinus/iva_mt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multiverb IVA MT

Generating diverse verb variants with VerbNet and Conditional Beam Search for enhanced performance of Intelligent Virtual Assistants (IVA) training set translation.

Installation

You can easily install multiverb_iva_mt from PyPI:

pip install iva_mt

This command will download and install the latest version of multiverb_iva_mt along with its required dependencies.

Usage

from iva_mt.iva_mt import IVAMT

translator = IVAMT(src_lang="en", tgt_lang="pl")
#for single-best translation
translator.translate("set the temperature on <a>my<a> thermostat")
#for multi-variant translation
translator.generate_alternative_translations("set the temperature on <a>my<a> thermostat")

Available languages (en2xx): pl, es, de, fr, pt, sv, zh, ja, tr, hi

To use GPU and batching, provide information about device:

IVAMT(src_lang="en", tgt_lang="pl", device="cuda:0", batch_size=16)

On V100 this allows to translate ~100 sentences/minute.

To use baseline M2M100:

IVAMT(src_lang="en", tgt_lang="pl", model_name="facebook/m2m100_418M")

To load a local model from an archive:

# If the model archive is located at /path/to/your/model.tgz, it will be automatically extracted
# to the ~/.cache/huggingface/hub directory. Specify this path using the `model_name` parameter.
IVAMT(src_lang="en", tgt_lang="pl", model_name="/path/to/your/model.tgz")

Note: When loading a local model, the tokenizer used will still be cartesinus/iva_mt_wslot-m2m100_418M-{src_lang}-{tgt_lang} to ensure compatibility and optimal performance.

Training M2M100 Model

In this repository, we provide a script train.py to facilitate the training of M2M100 models on your specified translation tasks. To run the training script, it is recommended to have a GPU for computational acceleration. When training on Google Colab, it's advisable to use an A100 GPU as V100 might not have sufficient memory.

Prerequisites

  • Ensure that you have installed the necessary libraries by running the following command:
pip install transformers datasets sacrebleu

Usage

  1. Customize your training configuration by creating a JSON file (e.g., config/iva_mt_wslot-m2m100_418M-en-pl.json). In this file, specify the source language, target language, learning rate, weight decay, number of training epochs, and other relevant parameters.

  2. Execute the training script by running the following command:

python train.py --config config/iva_mt_wslot-m2m100_418M-en-pl.json

Configuration File

The configuration file should contain the following parameters:

  • src_lang: Source language code (e.g., "en" for English).
  • tgt_lang: Target language code (e.g., "pl" for Polish).
  • learning_rate: Learning rate for the optimizer.
  • weight_decay: Weight decay for the optimizer.
  • num_train_epochs: Number of training epochs.
  • model_space: The namespace for the model.
  • model_name: The name of the model.
  • dataset: The name of the dataset to be used for training.

Example Configuration:

{
    "src_lang": "en",
    "tgt_lang": "pl",
    "learning_rate": 5e-5,
    "weight_decay": 0.01,
    "num_train_epochs": 3,
    "model_space": "facebook",
    "model_name": "m2m100_418M",
    "dataset": "wmt16"
}

Running on Google Colab

If you are running the script on Google Colab, ensure to switch to a runtime with a GPU for better performance. It is recommended to use an A100 GPU as V100 might have memory limitations depending on the size of the model and the dataset.

About

A machine translation library utilizing m2m100 models, equipped with features for generating diverse verb variants via VerbNet and Conditional Beam Search to enrich Virtual Assistant training sets.

Resources

License

Stars

Watchers

Forks

Packages

No packages published