Generating diverse verb variants with VerbNet and Conditional Beam Search for enhanced performance of Intelligent Virtual Assistants (IVA) training set translation.
You can easily install multiverb_iva_mt from PyPI:
pip install iva_mt
This command will download and install the latest version of multiverb_iva_mt along with its required dependencies.
from iva_mt.iva_mt import IVAMT
translator = IVAMT(src_lang="en", tgt_lang="pl")
#for single-best translation
translator.translate("set the temperature on <a>my<a> thermostat")
#for multi-variant translation
translator.generate_alternative_translations("set the temperature on <a>my<a> thermostat")
Available languages (en2xx): pl, es, de, fr, pt, sv, zh, ja, tr, hi
To use GPU and batching, provide information about device:
IVAMT(src_lang="en", tgt_lang="pl", device="cuda:0", batch_size=16)
On V100 this allows to translate ~100 sentences/minute.
To use baseline M2M100:
IVAMT(src_lang="en", tgt_lang="pl", model_name="facebook/m2m100_418M")
To load a local model from an archive:
# If the model archive is located at /path/to/your/model.tgz, it will be automatically extracted
# to the ~/.cache/huggingface/hub directory. Specify this path using the `model_name` parameter.
IVAMT(src_lang="en", tgt_lang="pl", model_name="/path/to/your/model.tgz")
Note: When loading a local model, the tokenizer used will still be cartesinus/iva_mt_wslot-m2m100_418M-{src_lang}-{tgt_lang}
to ensure compatibility and optimal performance.
In this repository, we provide a script train.py
to facilitate the training of M2M100 models on your specified translation tasks. To run the training script, it is recommended to have a GPU for computational acceleration. When training on Google Colab, it's advisable to use an A100 GPU as V100 might not have sufficient memory.
- Ensure that you have installed the necessary libraries by running the following command:
pip install transformers datasets sacrebleu
-
Customize your training configuration by creating a JSON file (e.g.,
config/iva_mt_wslot-m2m100_418M-en-pl.json
). In this file, specify the source language, target language, learning rate, weight decay, number of training epochs, and other relevant parameters. -
Execute the training script by running the following command:
python train.py --config config/iva_mt_wslot-m2m100_418M-en-pl.json
The configuration file should contain the following parameters:
src_lang
: Source language code (e.g., "en" for English).tgt_lang
: Target language code (e.g., "pl" for Polish).learning_rate
: Learning rate for the optimizer.weight_decay
: Weight decay for the optimizer.num_train_epochs
: Number of training epochs.model_space
: The namespace for the model.model_name
: The name of the model.dataset
: The name of the dataset to be used for training.
Example Configuration:
{
"src_lang": "en",
"tgt_lang": "pl",
"learning_rate": 5e-5,
"weight_decay": 0.01,
"num_train_epochs": 3,
"model_space": "facebook",
"model_name": "m2m100_418M",
"dataset": "wmt16"
}
If you are running the script on Google Colab, ensure to switch to a runtime with a GPU for better performance. It is recommended to use an A100 GPU as V100 might have memory limitations depending on the size of the model and the dataset.